Predictive analytics: Poor outcome
Thoughts
Questions
- What deficit are we talking about? Is it covered by the mRS?
- Should the pre-treatment mRS be in the model? Already covered by the others?
Backlog
- Honest estimation
- G estimation (using only significant ones to get accurate ORs)
- Make sure that the number of outcomes (26) is valid.
- Bootstrap the whole process - how many models include each of the features?
- The SuSiE method: https://www.biorxiv.org/content/10.1101/501114v1
- Choose best model with forward/backward exclusion
- Feature selection with honest estimation + p-value-based + bootstrap
- Feature selection with honest estimation + LASSO-based + bootstrap
- Implement PCA to create independent variables and not drop variables
- Bayesian fitting
- Change to use tidybayes and tidymodels
- Use neural networks to fit
- Consider a time-to-event analysis
Setup
Imports
# Load packages
library(glmnet)
library(magrittr)
library(selectiveInference)
library(tidyverse)
# Data
# https://office365stanford-my.sharepoint.com/:x:/r/personal/simonlev_stanford_edu/_layouts/15/Doc.aspx?sourcedoc=%7BCE7DAF19-8F1A-43B0-866D-3D0F36336EDC%7D&file=Peds%20AVM%20Database%20July%2010%202024%20Onwards.xlsx&fromShare=true&action=default&mobileredirect=true
# Source codeConfigurations
# Source outcome-specific configs
source(params$CONFIG)
# Destination locations
DST_DIRNAME <- paste0("../../outputs/")
# Source locations
SRC_DIRNAME <- "../../data/3_tidy/patients"
SRC_BASENAME <- "patients-daily_2024-07-21.csv"
# Should the Dockerfile be automatically updated?
UPDATE_DOCKERFILE <- FALSE
# Set the seed
set.seed(1891)Parameters
EXPOSURES_CONTINUOUS <- c(
"Age at 1st treatment (years)" = "age_at_first_treatment_yrs",
"mRS (presentation)" = "modified_rankin_score_presentation",
"mRS (pre-treatment)" = "modified_rankin_score_pretreatment",
"mRS (1-week post-op)" = "modified_rankin_score_postop_within_1_week",
"mRS (final)" = "modified_rankin_score_final",
"Nidus size (cm)" = "max_size_cm",
"Spetzler-Martin grade" = "spetzler_martin_grade",
"Lawton-Young grade" = "lawton_young_grade",
"Size score" = "size_score"
)
EXPOSURES_BINARY <- c(
"Sex (Male)" = "is_male",
"Involves eloquent location" = "is_eloquent_location",
"Has deep venous drainage" = "has_deep_venous_drainage",
"Diffuse nidus" = "is_diffuse_nidus",
"Hemorrhage at presentation" = "has_hemorrhage",
"Seizures at presentation" = "has_seizures",
"Deficit at presentation" = "has_deficit",
"Paresis at presentation" = "has_paresis",
"Presence of aneurysms" = "has_associated_aneurysm",
"Spetzler-Martin grade < 4" = "is_spetzler_martin_grade_less_than_4",
"Had surgery" = "is_surgery"
)
EXPOSURES_CATEGORICAL <- c(
"Location" = "location",
"Venous drainage" = "venous_drainage",
"Modality of treatment" = "procedure_combinations"
)
OUTCOMES <- c(
"Poor outcome (mRS >= 3)" = "is_poor_outcome",
"Obliteration" = "is_obliterated",
"Complications - minor" = "has_complication_minor",
"Complications - major" = "has_complication_major",
"mRS change (final - pre-treatment)" =
"modified_rankin_score_final_minus_pretx",
"mRS change (final - presentation)" =
"modified_rankin_score_final_minus_presentation",
"mRS change direction (Final - Pre-tx)" = "change_in_mrs_tx_vs_final"
)Functions
R utils.
Data analysis utils.
Read
# File path
filepath <- file.path(SRC_DIRNAME, SRC_BASENAME)
# Read
patients_ <-
read_csv(filepath) %>%
dplyr::sample_frac(params$PROPORTION_OF_DATA)
# If analyzing a subset of the data, restrict the dataset
if (params$TITLE == "Poor outcome - Hemorrhage") {
patients_ <- patients_ %>% filter(has_hemorrhage)
}
if (params$TITLE == "Poor outcome - No Hemorrhage") {
patients_ <- patients_ %>% filter(!has_hemorrhage)
}Conform
Setup
Remove all empty rows.
Create two dataframes - one for univariable and one for multivariable analysis. This is because variables for multivariable analysis will be recoded to reduce the number of levels to avoid overfitting the model.
Recode columns
For venous_drainage if “Both”, mark as “Deep.”
df_multi <-
df_multi %>%
mutate(
venous_drainage = ifelse(venous_drainage == "Both", "Deep", venous_drainage)
)For procedure_combinations, change > 1 to multimodal
to reduce levels.
df_multi <-
df_multi %>%
mutate(procedure_combinations = case_when(
nchar(procedure_combinations) > 1 ~ "Multimodal",
.default = procedure_combinations
))For procedure_combinations, indicate if surgery-based or
not.
df_multi <-
df_multi %>%
mutate(
is_surgery = str_detect(procedure_combinations, "S")
) %>%
relocate(is_surgery, .after = procedure_combinations)
df_uni <-
df_uni %>%
mutate(
is_surgery = str_detect(procedure_combinations, "S")
) %>%
relocate(is_surgery, .after = procedure_combinations)For location, reduce choices (use “Other”).
df_multi <-
df_multi %>%
mutate(
location = ifelse(location == "Corpus Callosum", "Other", location),
location = ifelse(location == "Cerebellum", "Other", location),
# location = ifelse(location == "Deep", "Other", location),
location = factor(location),
location = relevel(location, ref = "Other")
)For spetzler_martin_grade, create a binary variant of
1-3 vs. 4-5.
# For multivariable analysis
df_multi <-
df_multi %>%
mutate(
is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
) %>%
relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)
# For univariable analysis
df_uni <-
df_uni %>%
mutate(
is_spetzler_martin_grade_less_than_4 = spetzler_martin_grade < 4
) %>%
relocate(is_spetzler_martin_grade_less_than_4, .after = spetzler_martin_grade)Missingness
# Get cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
OUTCOMES
))
# Count missingness
df_multi %>%
select(cols) %>%
summarise_all(~ sum(is.na(.))) %>%
pivot_longer(everything(), values_to = "num_missing") %>%
arrange(desc(num_missing)) %>%
filter(num_missing > 0) %>%
sable()| name | num_missing |
|---|---|
| lawton_young_grade | 10 |
| spetzler_martin_grade | 8 |
| is_spetzler_martin_grade_less_than_4 | 8 |
| size_score | 6 |
| has_associated_aneurysm | 6 |
| max_size_cm | 5 |
| is_eloquent_location | 5 |
| has_deep_venous_drainage | 4 |
| modified_rankin_score_pretreatment | 3 |
| modified_rankin_score_postop_within_1_week | 3 |
| is_surgery | 3 |
| venous_drainage | 3 |
| procedure_combinations | 3 |
| modified_rankin_score_final_minus_pretx | 3 |
| has_complication_minor | 2 |
| has_complication_major | 2 |
| modified_rankin_score_final | 1 |
| is_male | 1 |
| is_diffuse_nidus | 1 |
| location | 1 |
| is_poor_outcome | 1 |
| modified_rankin_score_final_minus_presentation | 1 |
| change_in_mrs_tx_vs_final | 1 |
Which eligible patients are missing each variable?
df_multi %>%
filter(is_eligible) %>%
select(mrn, cols) %>%
mutate(across(-mrn, is.na)) %>%
pivot_longer(
cols = -mrn,
names_to = "name",
values_to = "is_missing"
) %>%
filter(is_missing) %>%
group_by(name) %>%
summarize(mrn = str_c(mrn, collapse = ", ")) %>%
sable()| name | mrn |
|---|---|
| change_in_mrs_tx_vs_final | 60860434 |
| has_associated_aneurysm | 60233426, 60310463, 60303054, 61011961, 62696083, 62609300 |
| has_complication_major | 60860434, 62023254 |
| has_complication_minor | 60860434, 62023254 |
| has_deep_venous_drainage | 60233426, 62609300, 42446310, 81981102 |
| is_diffuse_nidus | 15871379 |
| is_eloquent_location | 62728597, 62696083, 62609300, 42446310, 81981102 |
| is_male | 19399469 |
| is_poor_outcome | 60860434 |
| is_spetzler_martin_grade_less_than_4 | 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102 |
| is_surgery | 49659493, 46166047, 62609300 |
| lawton_young_grade | 60233426, 60310463, 60303054, 60860434, 61011961, 62728597, 62696083, 62609300, 42446310, 81981102 |
| location | 60310463 |
| max_size_cm | 60233426, 60303054, 60860434, 62609300, 81981102 |
| modified_rankin_score_final | 60860434 |
| modified_rankin_score_final_minus_presentation | 60860434 |
| modified_rankin_score_final_minus_pretx | 60860434, 49659493, 46166047 |
| modified_rankin_score_postop_within_1_week | 60860434, 49659493, 46166047 |
| modified_rankin_score_pretreatment | 60860434, 49659493, 46166047 |
| procedure_combinations | 49659493, 46166047, 62609300 |
| size_score | 60233426, 60303054, 60860434, 62609300, 42446310, 81981102 |
| spetzler_martin_grade | 60233426, 60303054, 60860434, 62728597, 62696083, 62609300, 42446310, 81981102 |
| venous_drainage | 60233426, 62609300, 81981102 |
Visualizations
Overall
Distribution of values of the exposures across levels of the outcome.
# Define the predictors of interest
cols <- c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
CHOSEN_OUTCOME
)
# Reshape the data
df_long <-
df_uni %>%
select(all_of(cols)) %>%
pivot_longer(-CHOSEN_OUTCOME, names_to = "predictor", values_to = "value")
# Plot the box plots
df_long %>%
ggplot(aes_string(
x = CHOSEN_OUTCOME,
y = "value",
fill = CHOSEN_OUTCOME,
color = CHOSEN_OUTCOME
)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~predictor, scales = "free") +
labs(x = "Outcome", y = "Value", fill = "Outcome") +
theme_minimal() +
theme(
axis.text = element_text(size = 10),
axis.title = element_text(size = 11),
strip.text = element_text(size = 11),
legend.position = "none"
)By hemorrhage
# Define the predictors of interest
cols <- c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
CHOSEN_OUTCOME
)
# Reshape the data
df_long <-
df_uni %>%
select(all_of(cols)) %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME, `Hemorrhage at presentation`),
names_to = "predictor",
values_to = "value"
)
# Plot the box plots
df_long %>%
ggplot(aes_string(
x = CHOSEN_OUTCOME,
y = "value",
fill = "`Hemorrhage at presentation`",
color = "`Hemorrhage at presentation`"
)) +
geom_boxplot(alpha = 0.5) +
facet_wrap(~predictor, scales = "free") +
labs(x = "Outcome", y = "Value", fill = "Outcome") +
theme_minimal() +
guides(fill = "none") +
theme(
axis.text = element_text(size = 10),
axis.title = element_text(size = 11),
strip.text = element_text(size = 11)
)Descriptive statistics
Setup
Descriptive statistics - continuous variables.
compute_descriptive_continuous <- function(
df = df_uni,
cols = c(EXPOSURES_CONTINUOUS),
use_interaction = TRUE,
interaction_col = "has_hemorrhage",
interaction_col_name = "Hemorrhage at presentation",
interaction_col_value = TRUE,
prefix = "Haem"
) {
# Apply desired formatting to numbers
format_numbers <- function(x) {
sprintf("%.1f", x)
}
# Compute new dataset
df <-
df %>%
select(cols, CHOSEN_OUTCOME, interaction_col) %>%
drop_na(CHOSEN_OUTCOME, interaction_col)
# Apply interaction
if (use_interaction) {
df <-
df %>%
filter(!!sym(interaction_col) == interaction_col_value)
}
# Compute descriptive statistics
desc_stats <-
df %>%
pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
summarize(
num_total = n(),
num_missing = sum(is.na(values)),
mean = mean(values, na.rm = TRUE),
sd = sd(values, na.rm = TRUE),
min = quantile(values, 0, na.rm = TRUE),
max = quantile(values, 1, na.rm = TRUE),
q_50 = quantile(values, 0.50, na.rm = TRUE),
q_25 = quantile(values, 0.25, na.rm = TRUE),
q_75 = quantile(values, 0.75, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
stats = glue::glue("{format_numbers(mean)} ({format_numbers(sd)})"),
!!sym(CHOSEN_OUTCOME) := factor(
!!sym(CHOSEN_OUTCOME),
levels = CHOSEN_OUTCOME_LEVELS,
labels = CHOSEN_OUTCOME_LABELS
)
) %>%
pivot_wider(
id_cols = keys,
names_from = !!sym(CHOSEN_OUTCOME),
values_from = stats
) %>%
relocate(CHOSEN_OUTCOME_LABELS[2], .after = CHOSEN_OUTCOME_LABELS[1])
# Compute p-values
pvals <-
df %>%
pivot_longer(where(is.numeric), names_to = "keys", values_to = "values") %>%
group_by(keys) %>%
summarize(
pvalue = wilcox.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
)
# Synthesize
out <-
desc_stats %>%
left_join(pvals)
# Add sample size to column names
num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))
new_col_names <- c(
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
glue::glue("{prefix} - P-value")
)
out <-
out %>%
rename_with(
.cols = !!sym(CHOSEN_OUTCOME_LABELS[1]):pvalue, ~ new_col_names
) %>%
mutate(
keys = ifelse(keys == interaction_col, interaction_col_name, keys)
)
return(out)
}Descriptive statistics - binary variables.
compute_descriptive_binary <- function(
df = df_uni,
cols = c(EXPOSURES_BINARY),
use_interaction = TRUE,
interaction_col = "has_hemorrhage",
interaction_col_name = "Hemorrhage at presentation",
interaction_col_value = TRUE,
prefix = "Haem"
) {
# Define arguments
cols <- c(cols[!cols == interaction_col])
# Apply desired formatting to numbers
format_numbers <- function(x, decimals = 0) {
decimals <- paste0("%.", decimals, "f")
sprintf(decimals, x)
}
# Compute new dataset
df <-
df %>%
select(all_of(cols), CHOSEN_OUTCOME, interaction_col) %>%
drop_na(CHOSEN_OUTCOME, interaction_col)
# Apply interaction
if (use_interaction) {
df <-
df %>%
filter(!!sym(interaction_col) == interaction_col_value)
}
# Compute descriptive statistics
desc_stats <-
df %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME),
names_to = "keys",
values_to = "values"
) %>%
group_by(!!sym(CHOSEN_OUTCOME), keys) %>%
summarize(
num_total = n(),
num_missing = sum(is.na(values)),
num_with = sum(values, na.rm = TRUE),
num_without = sum(!values, na.rm = TRUE),
pct_with = mean(values, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
stats = glue::glue("{num_with} ({format_numbers(pct_with*100)}%)"),
!!sym(CHOSEN_OUTCOME) := factor(
!!sym(CHOSEN_OUTCOME),
levels = CHOSEN_OUTCOME_LEVELS,
labels = CHOSEN_OUTCOME_LABELS
)
) %>%
pivot_wider(
id_cols = keys,
names_from = !!sym(CHOSEN_OUTCOME),
values_from = stats
) %>%
relocate(CHOSEN_OUTCOME_LABELS[1], .after = CHOSEN_OUTCOME_LABELS[2])
# Compute p-values
pvals <-
df %>%
pivot_longer(
cols = -c(CHOSEN_OUTCOME),
names_to = "keys",
values_to = "values"
) %>%
group_by(keys) %>%
summarize(
pvalue = kruskal.test(values ~ !!sym(CHOSEN_OUTCOME))$p.value
)
# Synthesize
out <-
desc_stats %>%
left_join(pvals)
# Prettify
num_with_outcome <- sum(df %>% pull(CHOSEN_OUTCOME))
num_no_outcome <- sum(!df %>% pull(CHOSEN_OUTCOME))
new_col_names <- c(
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[2]} (n={num_with_outcome})"),
glue::glue("{prefix} - {CHOSEN_OUTCOME_LABELS[1]} (n={num_no_outcome})"),
glue::glue("{prefix} - P-value")
)
out <-
out %>%
rename_with(
.cols = !!sym(CHOSEN_OUTCOME_LABELS[2]):pvalue, ~new_col_names
) %>%
mutate(
keys = ifelse(keys == interaction_col, interaction_col_name, keys)
)
return(out)
}Non-specific
# Rename dataframe
`Pediatric AVMs` <-
df_uni %>%
filter(is_eligible) %>%
select(-is_eligible, -comments)
# Create summary
`Pediatric AVMs` %>%
summarytools::dfSummary(display.labels = FALSE) %>%
print(
file =
"../../outputs/descriptive-statistics/descriptive_statistics_eligible.html",
footnote = NA
)
# Remove unwanted dataframe
remove(`Pediatric AVMs`)By hemorrhage - table1 (WIP)
https://cran.r-project.org/web/packages/table1/vignettes/table1-examples.html
| FALSE (N=92) |
TRUE (N=144) |
Overall (N=236) |
|
|---|---|---|---|
| factor(is_male) | |||
| FALSE | 43 (46.7%) | 67 (46.5%) | 110 (46.6%) |
| TRUE | 48 (52.2%) | 77 (53.5%) | 125 (53.0%) |
| Missing | 1 (1.1%) | 0 (0%) | 1 (0.4%) |
| age_at_first_treatment_yrs | |||
| Mean (SD) | 11.8 (4.25) | 10.9 (3.52) | 11.3 (3.84) |
| Median [Min, Max] | 13.0 [0, 18.0] | 11.0 [2.00, 18.0] | 11.0 [0, 18.0] |
| factor(is_eloquent_location) | |||
| FALSE | 28 (30.4%) | 61 (42.4%) | 89 (37.7%) |
| TRUE | 64 (69.6%) | 78 (54.2%) | 142 (60.2%) |
| Missing | 0 (0%) | 5 (3.5%) | 5 (2.1%) |
| max_size_cm | |||
| Mean (SD) | 4.16 (2.48) | 2.95 (2.05) | 3.43 (2.30) |
| Median [Min, Max] | 3.50 [0.700, 16.0] | 2.50 [0.400, 12.5] | 3.00 [0.400, 16.0] |
| Missing | 1 (1.1%) | 4 (2.8%) | 5 (2.1%) |
Overall |
No Hemorrhage |
Has Hemorrhage |
||||
|---|---|---|---|---|---|---|
| Good outcome (N=208) |
Poor outcome (N=27) |
Good outcome (N=81) |
Poor outcome (N=11) |
Good outcome (N=127) |
Poor outcome (N=16) |
|
| Age at 1st treatment (years) | 11 (± 3.7) | 9.7 (± 4.3) | 12 (± 4.2) | 10 (± 4.6) | 11 (± 3.4) | 9.4 (± 4.2) |
| Median (IQR) | 12 (9.0 - 15) | 9.0 (6.0 - 13) | 13 (9.0 - 15) | 10 (6.5 - 15) | 11 (9.0 - 14) | 9.0 (5.8 - 12) |
| mRS (presentation) | 2.2 (± 1.8) | 3.0 (± 1.7) | 0.95 (± 0.92) | 1.9 (± 1.1) | 3.0 (± 1.7) | 3.8 (± 1.5) |
| Median (IQR) | 2.0 (1.0 - 4.0) | 3.0 (1.0 - 4.5) | 1.0 (0 - 1.0) | 1.0 (1.0 - 3.0) | 3.0 (1.0 - 5.0) | 4.0 (3.8 - 5.0) |
| mRS (pre-treatment) | 1.5 (± 1.4) | 2.7 (± 1.6) | 0.84 (± 0.78) | 1.8 (± 1.3) | 1.9 (± 1.5) | 3.4 (± 1.5) |
| Median (IQR) | 1.0 (1.0 - 2.0) | 3.0 (1.0 - 4.0) | 1.0 (0 - 1.0) | 1.0 (1.0 - 3.0) | 1.0 (1.0 - 3.0) | 3.5 (2.8 - 5.0) |
| Missing | 2 (1.0%) | 0 (0%) | 0 (0%) | 0 (0%) | 2 (1.6%) | 0 (0%) |
| mRS (1-week post-op) | 1.4 (± 1.2) | 3.6 (± 1.1) | 1.0 (± 0.99) | 2.9 (± 1.1) | 1.7 (± 1.2) | 4.1 (± 0.77) |
| Median (IQR) | 1.0 (1.0 - 2.0) | 4.0 (3.0 - 4.0) | 1.0 (0 - 1.0) | 3.0 (3.0 - 3.0) | 1.0 (1.0 - 2.0) | 4.0 (3.8 - 5.0) |
| Missing | 2 (1.0%) | 0 (0%) | 0 (0%) | 0 (0%) | 2 (1.6%) | 0 (0%) |
| mRS (final) | 0.64 (± 0.69) | 3.9 (± 1.1) | 0.56 (± 0.69) | 3.6 (± 1.2) | 0.70 (± 0.68) | 4.0 (± 1.0) |
| Median (IQR) | 1.0 (0 - 1.0) | 3.0 (3.0 - 4.0) | 0 (0 - 1.0) | 3.0 (3.0 - 3.5) | 1.0 (0 - 1.0) | 4.0 (3.0 - 4.3) |
| Nidus size (cm) | 3.3 (± 2.1) | 4.8 (± 3.1) | 3.9 (± 2.4) | 6.1 (± 2.6) | 2.8 (± 1.9) | 3.9 (± 3.2) |
| Median (IQR) | 3.0 (2.0 - 4.0) | 4.0 (2.1 - 7.0) | 3.5 (2.5 - 4.6) | 6.3 (4.5 - 7.7) | 2.5 (1.5 - 3.6) | 3.0 (1.9 - 4.5) |
| Missing | 3 (1.4%) | 1 (3.7%) | 1 (1.2%) | 0 (0%) | 2 (1.6%) | 1 (6.3%) |
| Spetzler-Martin grade | 2.8 (± 1.2) | 3.6 (± 1.3) | 3.0 (± 1.2) | 4.1 (± 1.0) | 2.6 (± 1.1) | 3.3 (± 1.3) |
| Median (IQR) | 3.0 (2.0 - 4.0) | 4.0 (3.0 - 5.0) | 3.0 (2.0 - 4.0) | 4.0 (3.5 - 5.0) | 3.0 (2.0 - 3.0) | 3.0 (2.5 - 4.0) |
| Missing | 6 (2.9%) | 1 (3.7%) | 1 (1.2%) | 0 (0%) | 5 (3.9%) | 1 (6.3%) |
| Lawton-Young grade | 4.3 (± 1.5) | 5.3 (± 1.8) | 5.0 (± 1.5) | 6.3 (± 1.5) | 3.8 (± 1.4) | 4.5 (± 1.6) |
| Median (IQR) | 4.0 (3.0 - 5.0) | 5.0 (4.0 - 6.8) | 5.0 (4.0 - 6.0) | 6.0 (5.5 - 7.5) | 4.0 (3.0 - 4.8) | 5.0 (3.5 - 5.5) |
| Missing | 8 (3.8%) | 1 (3.7%) | 3 (3.7%) | 0 (0%) | 5 (3.9%) | 1 (6.3%) |
| Size score | 1.6 (± 0.71) | 2.0 (± 0.82) | 1.9 (± 0.74) | 2.5 (± 0.69) | 1.5 (± 0.64) | 1.7 (± 0.80) |
| Median (IQR) | 1.5 (1.0 - 2.0) | 2.0 (1.0 - 3.0) | 2.0 (1.0 - 2.0) | 3.0 (2.0 - 3.0) | 1.0 (1.0 - 2.0) | 2.0 (1.0 - 2.0) |
| Missing | 4 (1.9%) | 1 (3.7%) | 1 (1.2%) | 0 (0%) | 3 (2.4%) | 1 (6.3%) |
| Sex (Male) | ||||||
| Yes | 109 (52.4%) | 15 (55.6%) | 40 (49.4%) | 8 (72.7%) | 69 (54.3%) | 7 (43.8%) |
| No | 98 (47.1%) | 12 (44.4%) | 40 (49.4%) | 3 (27.3%) | 58 (45.7%) | 9 (56.3%) |
| Missing | 1 (0.5%) | 0 (0%) | 1 (1.2%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Involves eloquent location | ||||||
| Yes | 118 (56.7%) | 23 (85.2%) | 53 (65.4%) | 11 (100%) | 65 (51.2%) | 12 (75.0%) |
| No | 86 (41.3%) | 3 (11.1%) | 28 (34.6%) | 0 (0%) | 58 (45.7%) | 3 (18.8%) |
| Missing | 4 (1.9%) | 1 (3.7%) | 0 (0%) | 0 (0%) | 4 (3.1%) | 1 (6.3%) |
| Has deep venous drainage | ||||||
| Yes | 110 (52.9%) | 17 (63.0%) | 35 (43.2%) | 8 (72.7%) | 75 (59.1%) | 9 (56.3%) |
| No | 95 (45.7%) | 9 (33.3%) | 45 (55.6%) | 3 (27.3%) | 50 (39.4%) | 6 (37.5%) |
| Missing | 3 (1.4%) | 1 (3.7%) | 1 (1.2%) | 0 (0%) | 2 (1.6%) | 1 (6.3%) |
| Diffuse nidus | ||||||
| Yes | 22 (10.6%) | 9 (33.3%) | 10 (12.3%) | 5 (45.5%) | 12 (9.4%) | 4 (25.0%) |
| No | 185 (88.9%) | 18 (66.7%) | 70 (86.4%) | 6 (54.5%) | 115 (90.6%) | 12 (75.0%) |
| Missing | 1 (0.5%) | 0 (0%) | 1 (1.2%) | 0 (0%) | 0 (0%) | 0 (0%) |
| Seizures at presentation | ||||||
| Yes | 52 (25.0%) | 7 (25.9%) | 22 (27.2%) | 5 (45.5%) | 30 (23.6%) | 2 (12.5%) |
| No | 156 (75.0%) | 20 (74.1%) | 59 (72.8%) | 6 (54.5%) | 97 (76.4%) | 14 (87.5%) |
| Deficit at presentation | ||||||
| Yes | 72 (34.6%) | 15 (55.6%) | 25 (30.9%) | 9 (81.8%) | 47 (37.0%) | 6 (37.5%) |
| No | 136 (65.4%) | 12 (44.4%) | 56 (69.1%) | 2 (18.2%) | 80 (63.0%) | 10 (62.5%) |
| Paresis at presentation | ||||||
| Yes | 51 (24.5%) | 9 (33.3%) | 15 (18.5%) | 3 (27.3%) | 36 (28.3%) | 6 (37.5%) |
| No | 157 (75.5%) | 18 (66.7%) | 66 (81.5%) | 8 (72.7%) | 91 (71.7%) | 10 (62.5%) |
| Presence of aneurysms | ||||||
| Yes | 40 (19.2%) | 6 (22.2%) | 10 (12.3%) | 2 (18.2%) | 30 (23.6%) | 4 (25.0%) |
| No | 162 (77.9%) | 21 (77.8%) | 68 (84.0%) | 9 (81.8%) | 94 (74.0%) | 12 (75.0%) |
| Missing | 6 (2.9%) | 0 (0%) | 3 (3.7%) | 0 (0%) | 3 (2.4%) | 0 (0%) |
| Spetzler-Martin grade < 4 | ||||||
| Yes | 148 (71.2%) | 11 (40.7%) | 50 (61.7%) | 3 (27.3%) | 98 (77.2%) | 8 (50.0%) |
| No | 54 (26.0%) | 15 (55.6%) | 30 (37.0%) | 8 (72.7%) | 24 (18.9%) | 7 (43.8%) |
| Missing | 6 (2.9%) | 1 (3.7%) | 1 (1.2%) | 0 (0%) | 5 (3.9%) | 1 (6.3%) |
| Had surgery | ||||||
| Yes | 122 (58.7%) | 16 (59.3%) | 38 (46.9%) | 6 (54.5%) | 84 (66.1%) | 10 (62.5%) |
| No | 83 (39.9%) | 11 (40.7%) | 43 (53.1%) | 5 (45.5%) | 40 (31.5%) | 6 (37.5%) |
| Missing | 3 (1.4%) | 0 (0%) | 0 (0%) | 0 (0%) | 3 (2.4%) | 0 (0%) |
By hemorrhage
# Define arguments
interaction_col = "has_hemorrhage"
interaction_col_name = "Hemorrhage at presentation"
prefix_true = "Haem"
prefix_false = "No Haem"
# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out_all <- bind_rows(out)
# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out_haem <- bind_rows(out)
# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out_no_haem <- bind_rows(out)
# Synthesize
out_complete <-
out_all %>%
left_join(out_haem, by = "keys") %>%
left_join(out_no_haem, by = "keys") %>%
arrange(`All - P-value`)
# Print
out_complete %>%
sable()| keys | All - Poor outcome (n=27) | All - Good outcome (n=208) | All - P-value | Haem - Poor outcome (n=16) | Haem - Good outcome (n=127) | Haem - P-value | No Haem - Poor outcome (n=11) | No Haem - Good outcome (n=81) | No Haem - P-value |
|---|---|---|---|---|---|---|---|---|---|
| mRS (final) | 0.6 (0.7) | 3.9 (1.1) | 0.000 | 0.7 (0.7) | 4.0 (1.0) | 0.000 | 0.6 (0.7) | 3.6 (1.2) | 0.000 |
| mRS (1-week post-op) | 1.4 (1.2) | 3.6 (1.1) | 0.000 | 1.7 (1.2) | 4.1 (0.8) | 0.000 | 1.0 (1.0) | 2.9 (1.1) | 0.000 |
| mRS (pre-treatment) | 1.5 (1.4) | 2.7 (1.6) | 0.000 | 1.9 (1.5) | 3.4 (1.5) | 0.001 | 0.8 (0.8) | 1.8 (1.3) | 0.008 |
| Diffuse nidus | 9 (33%) | 22 (11%) | 0.001 | 4 (25%) | 12 (9%) | 0.064 | 5 (45%) | 10 (12%) | 0.006 |
| Spetzler-Martin grade | 2.8 (1.2) | 3.6 (1.3) | 0.001 | 2.6 (1.1) | 3.3 (1.3) | 0.045 | 3.0 (1.2) | 4.1 (1.0) | 0.008 |
| Spetzler-Martin grade < 4 | 11 (42%) | 148 (73%) | 0.001 | 8 (53%) | 98 (80%) | 0.019 | 3 (27%) | 50 (62%) | 0.027 |
| Involves eloquent location | 23 (88%) | 118 (58%) | 0.003 | 12 (80%) | 65 (53%) | 0.046 | 11 (100%) | 53 (65%) | 0.020 |
| Lawton-Young grade | 4.3 (1.5) | 5.3 (1.8) | 0.005 | 3.8 (1.4) | 4.5 (1.6) | 0.078 | 5.0 (1.5) | 6.3 (1.5) | 0.013 |
| Size score | 1.6 (0.7) | 2.0 (0.8) | 0.013 | 1.5 (0.6) | 1.7 (0.8) | 0.209 | 1.9 (0.7) | 2.5 (0.7) | 0.018 |
| Nidus size (cm) | 3.3 (2.1) | 4.8 (3.1) | 0.015 | 2.8 (1.9) | 3.9 (3.2) | 0.300 | 3.9 (2.4) | 6.1 (2.6) | 0.005 |
| mRS (presentation) | 2.2 (1.8) | 3.0 (1.7) | 0.031 | 3.0 (1.7) | 3.8 (1.5) | 0.161 | 1.0 (0.9) | 1.9 (1.1) | 0.005 |
| Deficit at presentation | 15 (56%) | 72 (35%) | 0.034 | 6 (38%) | 47 (37%) | 0.969 | 9 (82%) | 25 (31%) | 0.001 |
| Age at 1st treatment (years) | 11.5 (3.7) | 9.7 (4.3) | 0.048 | 11.1 (3.4) | 9.4 (4.2) | 0.116 | 12.0 (4.2) | 10.3 (4.6) | 0.194 |
| Has deep venous drainage | 17 (65%) | 110 (54%) | 0.259 | 9 (60%) | 75 (60%) | 1.000 | 8 (73%) | 35 (44%) | 0.073 |
| Paresis at presentation | 9 (33%) | 51 (25%) | 0.324 | 6 (38%) | 36 (28%) | 0.450 | 3 (27%) | 15 (19%) | 0.495 |
| Presence of aneurysms | 6 (22%) | 40 (20%) | 0.769 | 4 (25%) | 30 (24%) | 0.944 | 2 (18%) | 10 (13%) | 0.628 |
| Sex (Male) | 15 (56%) | 109 (53%) | 0.777 | 7 (44%) | 69 (54%) | 0.426 | 8 (73%) | 40 (50%) | 0.159 |
| Hemorrhage at presentation | 16 (59%) | 127 (61%) | 0.857 | 16 (100%) | 127 (100%) | 0 (0%) | 0 (0%) | ||
| Seizures at presentation | 7 (26%) | 52 (25%) | 0.917 | 2 (12%) | 30 (24%) | 0.316 | 5 (45%) | 22 (27%) | 0.214 |
| Had surgery | 16 (59%) | 122 (60%) | 0.980 | 10 (62%) | 84 (68%) | 0.675 | 6 (55%) | 38 (47%) | 0.636 |
By diffuse nidus
# Define arguments
interaction_col = "is_diffuse_nidus"
interaction_col_name = "Is diffuse nidus"
prefix_true = "Diffuse"
prefix_false = "Not Diffuse"
# Get all
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out[[2]] <- compute_descriptive_binary(
use_interaction = F,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = "All"
)
out_all <- bind_rows(out)
# Get with hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = T,
prefix = prefix_true
)
out_haem <- bind_rows(out)
# Get without hemorrhage
out <- list()
out[[1]] <- compute_descriptive_continuous(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out[[2]] <- compute_descriptive_binary(
use_interaction = T,
interaction_col = interaction_col,
interaction_col_name = interaction_col_name,
interaction_col_value = F,
prefix = prefix_false
)
out_no_haem <- bind_rows(out)
# Synthesize
out_complete <-
out_all %>%
left_join(out_haem, by = "keys") %>%
left_join(out_no_haem, by = "keys") %>%
arrange(`All - P-value`)
# Print
out_complete %>%
sable()| keys | All - Poor outcome (n=27) | All - Good outcome (n=207) | All - P-value | Diffuse - Poor outcome (n=9) | Diffuse - Good outcome (n=22) | Diffuse - P-value | Not Diffuse - Poor outcome (n=18) | Not Diffuse - Good outcome (n=185) | Not Diffuse - P-value |
|---|---|---|---|---|---|---|---|---|---|
| mRS (final) | 0.6 (0.7) | 3.9 (1.1) | 0.000 | 1.0 (0.7) | 4.0 (1.2) | 0.000 | 0.6 (0.7) | 3.8 (1.1) | 0.000 |
| mRS (1-week post-op) | 1.4 (1.2) | 3.6 (1.1) | 0.000 | 1.3 (0.9) | 3.4 (1.1) | 0.000 | 1.4 (1.2) | 3.7 (1.1) | 0.000 |
| mRS (pre-treatment) | 1.5 (1.4) | 2.7 (1.6) | 0.000 | 1.3 (1.1) | 2.9 (1.5) | 0.012 | 1.5 (1.4) | 2.7 (1.6) | 0.002 |
| Is diffuse nidus | 9 (33%) | 22 (11%) | 0.001 | 9 (100%) | 22 (100%) | 0 (0%) | 0 (0%) | ||
| Spetzler-Martin grade | 2.8 (1.2) | 3.6 (1.3) | 0.001 | 4.3 (0.9) | 4.4 (0.7) | 0.702 | 2.6 (1.1) | 3.2 (1.3) | 0.055 |
| Spetzler-Martin grade < 4 | 11 (42%) | 147 (73%) | 0.001 | 1 (11%) | 4 (18%) | 0.633 | 10 (59%) | 143 (80%) | 0.045 |
| Involves eloquent location | 23 (88%) | 117 (58%) | 0.002 | 9 (100%) | 20 (91%) | 0.358 | 14 (82%) | 97 (54%) | 0.023 |
| Lawton-Young grade | 4.3 (1.5) | 5.3 (1.8) | 0.006 | 6.7 (1.2) | 6.8 (1.1) | 0.982 | 4.0 (1.3) | 4.5 (1.5) | 0.194 |
| Size score | 1.6 (0.7) | 2.0 (0.8) | 0.014 | 2.5 (0.7) | 2.6 (0.7) | 0.638 | 1.5 (0.6) | 1.8 (0.8) | 0.204 |
| Nidus size (cm) | 3.3 (2.1) | 4.8 (3.1) | 0.015 | 5.9 (3.0) | 7.5 (3.1) | 0.076 | 2.9 (1.8) | 3.4 (2.0) | 0.364 |
| mRS (presentation) | 2.2 (1.8) | 3.0 (1.7) | 0.031 | 1.6 (1.4) | 3.0 (1.6) | 0.037 | 2.3 (1.8) | 3.1 (1.7) | 0.091 |
| Deficit at presentation | 15 (56%) | 71 (34%) | 0.032 | 8 (89%) | 10 (45%) | 0.029 | 7 (39%) | 61 (33%) | 0.613 |
| Age at 1st treatment (years) | 11.5 (3.7) | 9.7 (4.3) | 0.045 | 11.4 (3.7) | 8.4 (3.1) | 0.053 | 11.5 (3.7) | 10.4 (4.7) | 0.446 |
| Has deep venous drainage | 17 (65%) | 110 (54%) | 0.269 | 7 (78%) | 20 (91%) | 0.330 | 10 (59%) | 90 (49%) | 0.461 |
| Paresis at presentation | 9 (33%) | 50 (24%) | 0.303 | 5 (56%) | 7 (32%) | 0.226 | 4 (22%) | 43 (23%) | 0.922 |
| Presence of aneurysms | 6 (22%) | 39 (19%) | 0.730 | 2 (22%) | 8 (36%) | 0.452 | 4 (22%) | 31 (17%) | 0.605 |
| Sex (Male) | 15 (56%) | 109 (53%) | 0.796 | 6 (67%) | 10 (45%) | 0.291 | 9 (50%) | 99 (54%) | 0.758 |
| Hemorrhage at presentation | 16 (59%) | 127 (61%) | 0.834 | 4 (44%) | 12 (55%) | 0.615 | 12 (67%) | 115 (62%) | 0.707 |
| Seizures at presentation | 7 (26%) | 52 (25%) | 0.928 | 4 (44%) | 7 (32%) | 0.512 | 3 (17%) | 45 (24%) | 0.467 |
| Had surgery | 16 (59%) | 122 (60%) | 0.957 | 4 (44%) | 11 (50%) | 0.782 | 12 (67%) | 111 (61%) | 0.638 |
Associations
Correlation matrix.
# Create new dataframe
df_ <-
df_uni %>%
select(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
OUTCOMES
)
# Convert the outcome variable to numeric for correlation calculation
df_ <-
df_ %>%
select(where(is.numeric), where(is.logical)) %>%
mutate(across(everything(), as.numeric))
# Compute the correlation matrix
cor_matrix <- cor(df_, use = "complete.obs", method = "spearman")
# Plot the correlation matrix
ggcorrplot::ggcorrplot(
cor_matrix,
method = "circle",
lab = TRUE,
lab_size = 2,
colors = c("red", "white", "green4"),
title = "Correlation Matrix",
hc.order = TRUE,
) +
theme(
axis.text.x = element_text(size = 8), # Adjust x-axis text size
axis.text.y = element_text(size = 8) # Adjust y-axis text size
)Univariable statistics
Setup
Define function.
fit_model <- function(
df = df_uni, cols = cols, is_sandwich = T) {
# Initialize
out <- list()
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
if (is_sandwich) {
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
fit_results <- lmtest::coeftest(fit, vcov. = robust_se)
} else {
fit_results <- fit
}
# Summarize model coefficients
fit_summary <-
fit_results %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}
return(out)
}Unadjusted - Original
Fit logistic.
# Define cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Fit model
out <- fit_model(df = df_uni, cols = cols, is_sandwich = T)
# Create table
univariable_unadjusted <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_unadjusted %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.66 | 0.12 | 4.192 | 0.000 | 1.31 | 2.10 |
| modified_rankin_score_postop_within_1_week | 3.42 | 0.19 | 6.585 | 0.000 | 2.37 | 4.94 |
| locationCorpus Callosum | 0.00 | 0.65 | -24.883 | 0.000 | 0.00 | 0.00 |
| is_diffuse_nidusTRUE | 4.20 | 0.47 | 3.079 | 0.002 | 1.69 | 10.49 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.27 | 0.43 | -3.083 | 0.002 | 0.12 | 0.62 |
| spetzler_martin_grade | 1.83 | 0.20 | 2.971 | 0.003 | 1.23 | 2.72 |
| lawton_young_grade | 1.47 | 0.14 | 2.762 | 0.006 | 1.12 | 1.92 |
| is_eloquent_locationTRUE | 5.59 | 0.63 | 2.731 | 0.006 | 1.63 | 19.21 |
| max_size_cm | 1.26 | 0.09 | 2.609 | 0.009 | 1.06 | 1.49 |
| size_score | 2.05 | 0.29 | 2.496 | 0.013 | 1.17 | 3.61 |
| modified_rankin_score_presentation | 1.28 | 0.10 | 2.399 | 0.016 | 1.05 | 1.56 |
| age_at_first_treatment_yrs | 0.89 | 0.05 | -2.071 | 0.038 | 0.80 | 0.99 |
| has_deficitTRUE | 2.36 | 0.41 | 2.076 | 0.038 | 1.05 | 5.31 |
| locationHemispheric | 0.32 | 0.65 | -1.742 | 0.082 | 0.09 | 1.15 |
| has_deep_venous_drainageTRUE | 1.63 | 0.44 | 1.124 | 0.261 | 0.69 | 3.83 |
| venous_drainageSuperficial | 0.54 | 0.56 | -1.091 | 0.275 | 0.18 | 1.63 |
| procedure_combinationsES | 0.24 | 1.33 | -1.085 | 0.278 | 0.02 | 3.21 |
| has_paresisTRUE | 1.54 | 0.44 | 0.983 | 0.326 | 0.65 | 3.64 |
| procedure_combinationsR | 0.40 | 1.23 | -0.742 | 0.458 | 0.04 | 4.50 |
| locationOther | 2.25 | 1.34 | 0.604 | 0.546 | 0.16 | 31.33 |
| procedure_combinationsS | 0.49 | 1.20 | -0.595 | 0.552 | 0.05 | 5.13 |
| procedure_combinationsRS | 0.47 | 1.34 | -0.560 | 0.575 | 0.03 | 6.57 |
| procedure_combinationsER | 0.62 | 1.20 | -0.404 | 0.686 | 0.06 | 6.48 |
| has_associated_aneurysmTRUE | 1.16 | 0.50 | 0.295 | 0.768 | 0.44 | 3.06 |
| is_maleTRUE | 1.12 | 0.41 | 0.284 | 0.777 | 0.50 | 2.52 |
| venous_drainageDeep | 0.87 | 0.55 | -0.260 | 0.795 | 0.30 | 2.54 |
| locationDeep | 1.13 | 0.63 | 0.186 | 0.853 | 0.32 | 3.90 |
| has_hemorrhageTRUE | 0.93 | 0.42 | -0.180 | 0.857 | 0.41 | 2.10 |
| has_seizuresTRUE | 1.05 | 0.47 | 0.104 | 0.917 | 0.42 | 2.62 |
| procedure_combinationsERS | 1.09 | 1.21 | 0.072 | 0.943 | 0.10 | 11.67 |
| is_surgeryTRUE | 0.99 | 0.42 | -0.025 | 0.980 | 0.44 | 2.24 |
Adjusted - Original
Fit logistic.
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_uni
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i]
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_adjusted <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_adjusted %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.63 | 0.12 | 3.965 | 0.000 | 1.28 | 2.07 |
| modified_rankin_score_postop_within_1_week | 3.76 | 0.25 | 5.341 | 0.000 | 2.31 | 6.11 |
| locationCorpus Callosum | 0.00 | 0.63 | -25.713 | 0.000 | 0.00 | 0.00 |
| spetzler_martin_grade | 1.85 | 0.19 | 3.195 | 0.001 | 1.27 | 2.70 |
| lawton_young_grade | 1.47 | 0.13 | 2.945 | 0.003 | 1.14 | 1.89 |
| is_eloquent_locationTRUE | 5.95 | 0.60 | 2.953 | 0.003 | 1.82 | 19.42 |
| is_diffuse_nidusTRUE | 3.97 | 0.46 | 2.989 | 0.003 | 1.61 | 9.80 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.28 | 0.44 | -2.910 | 0.004 | 0.12 | 0.66 |
| size_score | 2.13 | 0.29 | 2.641 | 0.008 | 1.21 | 3.72 |
| max_size_cm | 1.24 | 0.09 | 2.432 | 0.015 | 1.04 | 1.48 |
| modified_rankin_score_presentation | 1.23 | 0.10 | 2.042 | 0.041 | 1.01 | 1.51 |
| has_deficitTRUE | 2.16 | 0.42 | 1.842 | 0.065 | 0.95 | 4.91 |
| locationHemispheric | 0.34 | 0.63 | -1.697 | 0.090 | 0.10 | 1.18 |
| has_deep_venous_drainageTRUE | 1.80 | 0.45 | 1.311 | 0.190 | 0.75 | 4.32 |
| venous_drainageSuperficial | 0.49 | 0.57 | -1.232 | 0.218 | 0.16 | 1.52 |
| procedure_combinationsES | 0.23 | 1.44 | -1.031 | 0.302 | 0.01 | 3.80 |
| procedure_combinationsR | 0.37 | 1.34 | -0.732 | 0.464 | 0.03 | 5.20 |
| procedure_combinationsS | 0.42 | 1.30 | -0.668 | 0.504 | 0.03 | 5.38 |
| procedure_combinationsRS | 0.40 | 1.39 | -0.656 | 0.512 | 0.03 | 6.10 |
| has_paresisTRUE | 1.31 | 0.46 | 0.590 | 0.556 | 0.53 | 3.25 |
| has_associated_aneurysmTRUE | 1.23 | 0.50 | 0.420 | 0.675 | 0.46 | 3.30 |
| has_hemorrhageTRUE | 0.85 | 0.42 | -0.373 | 0.709 | 0.37 | 1.96 |
| procedure_combinationsER | 0.66 | 1.33 | -0.314 | 0.754 | 0.05 | 8.90 |
| locationDeep | 1.19 | 0.61 | 0.288 | 0.774 | 0.36 | 3.97 |
| venous_drainageDeep | 0.88 | 0.54 | -0.243 | 0.808 | 0.30 | 2.55 |
| is_surgeryTRUE | 0.92 | 0.42 | -0.208 | 0.836 | 0.40 | 2.09 |
| procedure_combinationsERS | 1.12 | 1.33 | 0.083 | 0.934 | 0.08 | 15.12 |
| has_seizuresTRUE | 1.03 | 0.47 | 0.058 | 0.954 | 0.41 | 2.59 |
Unadjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_multi
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(CHOSEN_OUTCOME, "~", cols[i]))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_unadjusted_recoded <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_unadjusted_recoded %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.66 | 0.12 | 4.192 | 0.000 | 1.31 | 2.10 |
| modified_rankin_score_postop_within_1_week | 3.42 | 0.19 | 6.585 | 0.000 | 2.37 | 4.94 |
| is_diffuse_nidusTRUE | 4.20 | 0.47 | 3.079 | 0.002 | 1.69 | 10.49 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.27 | 0.43 | -3.083 | 0.002 | 0.12 | 0.62 |
| spetzler_martin_grade | 1.83 | 0.20 | 2.971 | 0.003 | 1.23 | 2.72 |
| lawton_young_grade | 1.47 | 0.14 | 2.762 | 0.006 | 1.12 | 1.92 |
| is_eloquent_locationTRUE | 5.59 | 0.63 | 2.731 | 0.006 | 1.63 | 19.21 |
| max_size_cm | 1.26 | 0.09 | 2.609 | 0.009 | 1.06 | 1.49 |
| size_score | 2.05 | 0.29 | 2.496 | 0.013 | 1.17 | 3.61 |
| modified_rankin_score_presentation | 1.28 | 0.10 | 2.399 | 0.016 | 1.05 | 1.56 |
| age_at_first_treatment_yrs | 0.89 | 0.05 | -2.071 | 0.038 | 0.80 | 0.99 |
| has_deficitTRUE | 2.36 | 0.41 | 2.076 | 0.038 | 1.05 | 5.31 |
| locationHemispheric | 0.41 | 0.59 | -1.482 | 0.138 | 0.13 | 1.33 |
| venous_drainageSuperficial | 0.59 | 0.44 | -1.193 | 0.233 | 0.25 | 1.40 |
| has_deep_venous_drainageTRUE | 1.63 | 0.44 | 1.124 | 0.261 | 0.69 | 3.83 |
| has_paresisTRUE | 1.54 | 0.44 | 0.983 | 0.326 | 0.65 | 3.64 |
| procedure_combinationsR | 0.40 | 1.23 | -0.742 | 0.458 | 0.04 | 4.50 |
| locationDeep | 1.45 | 0.58 | 0.646 | 0.518 | 0.47 | 4.48 |
| procedure_combinationsS | 0.49 | 1.20 | -0.595 | 0.552 | 0.05 | 5.13 |
| procedure_combinationsMultimodal | 0.57 | 1.15 | -0.487 | 0.626 | 0.06 | 5.44 |
| has_associated_aneurysmTRUE | 1.16 | 0.50 | 0.295 | 0.768 | 0.44 | 3.06 |
| is_maleTRUE | 1.12 | 0.41 | 0.284 | 0.777 | 0.50 | 2.52 |
| is_surgeryTRUE | 0.91 | 0.49 | -0.193 | 0.847 | 0.35 | 2.38 |
| has_hemorrhageTRUE | 0.93 | 0.42 | -0.180 | 0.857 | 0.41 | 2.10 |
| has_seizuresTRUE | 1.05 | 0.47 | 0.104 | 0.917 | 0.42 | 2.62 |
Adjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Initialize
out <- list()
df <- df_multi
for (i in seq_along(cols)) {
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i]
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[i]] <- stylized
}Print results.
# Create table
univariable_adjusted_recoded <-
out %>%
bind_rows() %>%
filter(Predictors != "(Intercept)") %>%
filter(!str_starts(Predictors, "age_at_first_treatment_yrs|is_male")) %>%
filter(`CI (high)` < 50) %>%
arrange(`P-values`)
# Print
univariable_adjusted_recoded %>%
sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.63 | 0.12 | 3.965 | 0.000 | 1.28 | 2.07 |
| modified_rankin_score_postop_within_1_week | 3.76 | 0.25 | 5.341 | 0.000 | 2.31 | 6.11 |
| spetzler_martin_grade | 1.85 | 0.19 | 3.195 | 0.001 | 1.27 | 2.70 |
| lawton_young_grade | 1.47 | 0.13 | 2.945 | 0.003 | 1.14 | 1.89 |
| is_eloquent_locationTRUE | 5.95 | 0.60 | 2.953 | 0.003 | 1.82 | 19.42 |
| is_diffuse_nidusTRUE | 3.97 | 0.46 | 2.989 | 0.003 | 1.61 | 9.80 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.28 | 0.44 | -2.910 | 0.004 | 0.12 | 0.66 |
| size_score | 2.13 | 0.29 | 2.641 | 0.008 | 1.21 | 3.72 |
| max_size_cm | 1.24 | 0.09 | 2.432 | 0.015 | 1.04 | 1.48 |
| modified_rankin_score_presentation | 1.23 | 0.10 | 2.042 | 0.041 | 1.01 | 1.51 |
| has_deficitTRUE | 2.16 | 0.42 | 1.842 | 0.065 | 0.95 | 4.91 |
| locationHemispheric | 0.42 | 0.58 | -1.471 | 0.141 | 0.13 | 1.33 |
| venous_drainageSuperficial | 0.54 | 0.45 | -1.375 | 0.169 | 0.22 | 1.30 |
| has_deep_venous_drainageTRUE | 1.80 | 0.45 | 1.311 | 0.190 | 0.75 | 4.32 |
| procedure_combinationsR | 0.38 | 1.34 | -0.728 | 0.466 | 0.03 | 5.18 |
| locationDeep | 1.46 | 0.57 | 0.666 | 0.505 | 0.48 | 4.48 |
| procedure_combinationsS | 0.43 | 1.30 | -0.660 | 0.510 | 0.03 | 5.39 |
| has_paresisTRUE | 1.31 | 0.46 | 0.590 | 0.556 | 0.53 | 3.25 |
| is_surgeryTRUE | 0.80 | 0.50 | -0.446 | 0.656 | 0.30 | 2.14 |
| procedure_combinationsMultimodal | 0.58 | 1.26 | -0.437 | 0.662 | 0.05 | 6.85 |
| has_associated_aneurysmTRUE | 1.23 | 0.50 | 0.420 | 0.675 | 0.46 | 3.30 |
| has_hemorrhageTRUE | 0.85 | 0.42 | -0.373 | 0.709 | 0.37 | 1.96 |
| has_seizuresTRUE | 1.03 | 0.47 | 0.058 | 0.954 | 0.41 | 2.59 |
Interaction analysis
Adjusted - Recoded
Fit logistic.
# Define columns
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL
))
# Remove unwanted columns
unwanted <- c("modified_rankin_score_final")
cols <- cols[!cols %in% unwanted]
adjustors <- c(
"age_at_first_treatment_yrs",
"is_male"
)
# Initialize
out <- list()
df <- df_multi
k <- 1
for (i in seq_along(cols)) {
# Do not fit model for variables that we are adjusting for
if (cols[i] %in% adjustors) {
next
}
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"age_at_first_treatment_yrs + is_male +",
cols[i],
"* has_hemorrhage",
sep = ""
))
fit <- suppressWarnings(glm(
model,
data = df,
family = binomial()
))
# Calculate robust standard errors (sandwich)
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
out[[k]] <- stylized
k <- k + 1
}Print all results results.
# Define values
unwanted <- c(
"is_maleTRUE",
"age_at_first_treatment_yrs",
"(Intercept)",
"has_hemorrhageTRUE"
)
# Initialize objects
isolated_values <- list()
# Get main and interaction effects
for (i in seq_along(out)) {
# Get data
result <- out[[i]]
# Get all main effects of interest
main_effects <-
result %>%
filter(!Predictors %in% unwanted) %>%
filter(!str_detect(Predictors, ":")) %>%
select(-"SE", -"Z-scores")
# Get the interaction effects
interaction_effects <-
result %>%
filter(!Predictors %in% unwanted) %>%
filter(str_detect(Predictors, ":")) %>%
mutate(Predictors = str_extract(Predictors, "^[^:]+")) %>%
select(-"SE", -"Z-scores") %>%
rename_with(~ paste("Interaction -", .), -Predictors)
isolated_values[[i]] <-
main_effects %>%
left_join(interaction_effects, by = "Predictors")
}Create and print table
# Create table
univariable_adjusted_recoded_interactions <-
isolated_values %>%
bind_rows() %>%
arrange(`Interaction - P-values`)
# Print
univariable_adjusted_recoded_interactions %>%
sable()| Predictors | Odds Ratios (OR) | P-values | CI (low) | CI (high) | Interaction - Odds Ratios (OR) | Interaction - P-values | Interaction - CI (low) | Interaction - CI (high) |
|---|---|---|---|---|---|---|---|---|
| is_eloquent_locationTRUE | 2.10e+07 | 0.000 | 8.40e+06 | 5.26e+07 | 0.00 | 0.000 | 0.00 | 0.00 |
| procedure_combinationsMultimodal | 8.50e+05 | 0.000 | 9.97e+04 | 7.25e+06 | 0.00 | 0.000 | 0.00 | 0.00 |
| procedure_combinationsR | 8.40e+05 | 0.000 | 8.09e+04 | 8.71e+06 | 0.00 | 0.000 | 0.00 | 0.00 |
| procedure_combinationsS | 8.35e+05 | 0.000 | 3.97e+04 | 1.75e+07 | 0.00 | 0.000 | 0.00 | 0.00 |
| has_deficitTRUE | 7.74e+00 | 0.016 | 1.48e+00 | 4.06e+01 | 0.14 | 0.057 | 0.02 | 1.06 |
| has_seizuresTRUE | 2.29e+00 | 0.216 | 6.20e-01 | 8.53e+00 | 0.19 | 0.104 | 0.03 | 1.41 |
| modified_rankin_score_presentation | 2.07e+00 | 0.009 | 1.20e+00 | 3.57e+00 | 0.62 | 0.142 | 0.33 | 1.17 |
| venous_drainageSuperficial | 2.60e-01 | 0.061 | 6.00e-02 | 1.06e+00 | 3.26 | 0.183 | 0.57 | 18.59 |
| has_deep_venous_drainageTRUE | 3.71e+00 | 0.069 | 9.00e-01 | 1.52e+01 | 0.31 | 0.186 | 0.05 | 1.77 |
| modified_rankin_score_pretreatment | 2.59e+00 | 0.004 | 1.36e+00 | 4.94e+00 | 0.67 | 0.277 | 0.33 | 1.38 |
| lawton_young_grade | 1.72e+00 | 0.022 | 1.08e+00 | 2.73e+00 | 0.83 | 0.545 | 0.46 | 1.50 |
| size_score | 2.74e+00 | 0.036 | 1.07e+00 | 7.00e+00 | 0.71 | 0.589 | 0.20 | 2.47 |
| locationDeep | 8.90e-01 | 0.925 | 8.00e-02 | 1.03e+01 | 1.92 | 0.641 | 0.12 | 30.23 |
| modified_rankin_score_postop_within_1_week | 4.17e+00 | 0.001 | 1.84e+00 | 9.44e+00 | 1.23 | 0.649 | 0.50 | 3.03 |
| has_paresisTRUE | 1.01e+00 | 0.994 | 1.80e-01 | 5.61e+00 | 1.52 | 0.689 | 0.19 | 11.97 |
| is_diffuse_nidusTRUE | 4.82e+00 | 0.022 | 1.25e+00 | 1.85e+01 | 0.69 | 0.695 | 0.11 | 4.41 |
| spetzler_martin_grade | 2.10e+00 | 0.020 | 1.12e+00 | 3.93e+00 | 0.85 | 0.700 | 0.38 | 1.90 |
| has_associated_aneurysmTRUE | 1.49e+00 | 0.645 | 2.70e-01 | 8.12e+00 | 0.79 | 0.820 | 0.10 | 6.31 |
| is_surgeryTRUE | 1.00e+00 | 0.998 | 9.00e-02 | 1.07e+01 | 0.78 | 0.855 | 0.06 | 10.72 |
| locationHemispheric | 3.30e-01 | 0.379 | 3.00e-02 | 3.85e+00 | 1.23 | 0.886 | 0.08 | 19.69 |
| is_spetzler_martin_grade_less_than_4TRUE | 2.90e-01 | 0.085 | 7.00e-02 | 1.19e+00 | 0.88 | 0.893 | 0.14 | 5.52 |
| max_size_cm | 1.27e+00 | 0.111 | 9.50e-01 | 1.69e+00 | 0.98 | 0.911 | 0.67 | 1.43 |
Eloquence.
df %>%
count(has_hemorrhage, is_eloquent_location, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, is_eloquent_location) %>%
group_by(has_hemorrhage, is_eloquent_location) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | is_eloquent_location | is_poor_outcome | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 28 | 28 | 100% |
| FALSE | TRUE | FALSE | 53 | 64 | 83% |
| FALSE | TRUE | TRUE | 11 | 64 | 17% |
| TRUE | FALSE | FALSE | 58 | 61 | 95% |
| TRUE | FALSE | TRUE | 3 | 61 | 5% |
| TRUE | TRUE | FALSE | 65 | 77 | 84% |
| TRUE | TRUE | TRUE | 12 | 77 | 16% |
Deficit.
df %>%
count(has_hemorrhage, has_deficit, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, has_deficit) %>%
group_by(has_hemorrhage, has_deficit) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | has_deficit | is_poor_outcome | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 56 | 58 | 97% |
| FALSE | FALSE | TRUE | 2 | 58 | 3% |
| FALSE | TRUE | FALSE | 25 | 34 | 74% |
| FALSE | TRUE | TRUE | 9 | 34 | 26% |
| TRUE | FALSE | FALSE | 80 | 90 | 89% |
| TRUE | FALSE | TRUE | 10 | 90 | 11% |
| TRUE | TRUE | FALSE | 47 | 53 | 89% |
| TRUE | TRUE | TRUE | 6 | 53 | 11% |
Seizures.
df %>%
count(has_hemorrhage, has_seizures, !!sym(CHOSEN_OUTCOME)) %>%
drop_na() %>%
arrange(has_hemorrhage, has_seizures) %>%
group_by(has_hemorrhage, has_seizures) %>%
mutate(
num_total = sum(n),
pct_total = scales::percent(n / num_total, 1)
) %>%
sable()| has_hemorrhage | has_seizures | is_poor_outcome | n | num_total | pct_total |
|---|---|---|---|---|---|
| FALSE | FALSE | FALSE | 59 | 65 | 91% |
| FALSE | FALSE | TRUE | 6 | 65 | 9% |
| FALSE | TRUE | FALSE | 22 | 27 | 81% |
| FALSE | TRUE | TRUE | 5 | 27 | 19% |
| TRUE | FALSE | FALSE | 97 | 111 | 87% |
| TRUE | FALSE | TRUE | 14 | 111 | 13% |
| TRUE | TRUE | FALSE | 30 | 32 | 94% |
| TRUE | TRUE | TRUE | 2 | 32 | 6% |
Selective inference
Read the following guides: - How to use glmnet - Selective inference using forward selection
Setup
Define unwanted columns.
unwanted_all <- c(
# Use values at or before first treatment
"modified_rankin_score_postop_within_1_week",
"modified_rankin_score_final"
)
unwanted_without_gradings <- c(
unwanted_all,
# Spetzler-Martin grade (size score + eloquent location + venous drainage)
"spetzler_martin_grade",
"is_spetzler_martin_grade_less_than_4",
"size_score", # Already covered by the more detailed max_size_cm
"venous_drainage", # Already covered by has_deep_venous_drainage
"location", # Already covered to some extent by eloquence
# Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
# (SM is heavily overlapping with LY, so include its components)
"lawton_young_grade",
# Use mRS pre-treatment (more predictive)
"modified_rankin_score_presentation" # "modified_rankin_score_pretreatment"
)
# Define cols of interest - use grading scores whenever these exist
unwanted_with_gradings <- c(
# Spetzler-Martin grade (size score + eloquent location + venous drainage)
"is_spetzler_martin_grade_less_than_4", # The grade has lower p-value
"size_score",
"max_size_cm",
"is_eloquent_location",
# "location", # Include this as not highly correlated with SM
"has_deep_venous_drainage",
"venous_drainage",
# Lawton-Young grade (SM + age + diffuse nidus + hemorrhage)
# (SM is heavily overlapping with LY, so include its components)
"lawton_young_grade",
# "is_diffuse_nidus",
# "has_hemorrhage",
# Use values at or before first treatment
"modified_rankin_score_final",
"modified_rankin_score_postop_within_1_week",
# Use mRS pre-treatment (more predictive)
"modified_rankin_score_presentation"
# "modified_rankin_score_pretreatment"
)Create dataset.
# Define cols
cols <- unname(c(
EXPOSURES_CONTINUOUS,
EXPOSURES_BINARY,
EXPOSURES_CATEGORICAL,
CHOSEN_OUTCOME
))
# Define column sets for each analysis
cols_all <- cols[!cols %in% unwanted_all]
cols_with_gradings <- cols[!cols %in% unwanted_with_gradings]
cols_without_gradings <- cols[!cols %in% unwanted_without_gradings]
# Create df of interest
df_all <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_all)) %>%
drop_na()
df_with_gradings <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_with_gradings)) %>%
drop_na()
df_no_scores <-
# Use the non-recoded dataset
df_uni %>%
dplyr::select(all_of(cols_without_gradings)) %>%
drop_na()
# Create formula
frla <- as.formula(paste(CHOSEN_OUTCOME, " ~ . - 1"))
# Create matrices with all variables
X_all <- model.matrix(frla, df_all)
y_all <- df_all %>% pull(CHOSEN_OUTCOME) %>% as.numeric()
# Create matrices with scores
X_with_gradings <- model.matrix(frla, df_with_gradings)
y_with_gradings <- df_with_gradings %>% pull(CHOSEN_OUTCOME) %>% as.numeric()
# Create matrices with score components
X_without_gradings <- model.matrix(frla, df_no_scores)
y_without_gradings <- df_no_scores %>% pull(CHOSEN_OUTCOME) %>% as.numeric()
# X should be centered for LASSO (necessary for glmnet)
# (scale = FALSE as this is done by the standardization step in glmnet)
X_all_scaled <- scale(X_all, center = T, scale = F)
X_with_gradings_scaled <- scale(X_with_gradings, center = T, scale = F)
X_without_gradings_scaled <- scale(X_without_gradings, center = T, scale = F)
# See the names of X to match column numbers to column names later on
colnames(X_without_gradings_scaled) [1] "age_at_first_treatment_yrs" "modified_rankin_score_pretreatment"
[3] "max_size_cm" "is_maleFALSE"
[5] "is_maleTRUE" "is_eloquent_locationTRUE"
[7] "has_deep_venous_drainageTRUE" "is_diffuse_nidusTRUE"
[9] "has_hemorrhageTRUE" "has_seizuresTRUE"
[11] "has_deficitTRUE" "has_paresisTRUE"
[13] "has_associated_aneurysmTRUE" "is_surgeryTRUE"
[15] "procedure_combinationsER" "procedure_combinationsERS"
[17] "procedure_combinationsES" "procedure_combinationsR"
[19] "procedure_combinationsRS" "procedure_combinationsS"
Stepwise
Use step-wise linear regression methods.
# Set seed
set.seed(33)
# Run forward step-wise
fsfit <- fs(
x = X_without_gradings_scaled,
y = y_without_gradings,
maxsteps = 2000,
intercept = TRUE,
normalize = FALSE
)
# Estimate sigma
sigmahat <- estimateSigma(
x = X_without_gradings_scaled,
y = y_without_gradings,
intercept = TRUE,
standardize = FALSE
)$sigmahat# Compute sequential p-values and confidence intervals - for all
# (sigma estimated from full model)
out.seq <- fsInf(fsfit, type = "active", sigma = sigmahat)
out.seq
Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "active")
Standard deviation of noise (specified or estimated) sigma = 0.304
Sequential testing results with alpha = 0.100
Step Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 2 5.90e-02 4.153 0.047 1.00e-03 8.10e-02 0.05 0.049
2 3 3.00e-02 3.371 0.386 -7.00e-02 5.60e-02 0.05 0.049
3 6 8.30e-02 1.855 0.856 -2.49e+00 1.82e-01 0.05 0.050
4 1 -1.00e-02 -1.872 0.107 -3.09e-01 2.00e-02 0.05 0.050
5 8 9.80e-02 1.430 0.308 -1.26e+00 2.98e+00 0.05 0.050
6 12 -7.30e-02 -1.394 0.346 -1.12e+00 6.11e-01 0.05 0.050
7 9 -5.70e-02 -1.210 0.588 -6.35e-01 1.04e+00 0.05 0.050
8 11 6.20e-02 0.953 0.332 -8.73e-01 1.81e+00 0.05 0.050
9 10 -3.40e-02 -0.698 0.642 -7.74e-01 1.57e+00 0.05 0.050
10 17 -3.20e-02 -0.554 0.783 -1.62e+00 Inf 0.05 0.000
11 18 -3.10e-02 -0.534 0.217 -3.16e+00 6.94e-01 0.05 0.050
12 13 -1.90e-02 -0.350 0.822 -7.03e-01 4.81e+00 0.05 0.050
13 19 -2.60e-02 -0.334 0.642 -5.06e+00 Inf 0.05 0.000
14 15 -1.90e-02 -0.308 0.823 -3.35e+00 Inf 0.05 0.000
15 14 -5.60e-02 -0.389 0.630 -3.52e+00 6.54e+00 0.05 0.050
16 4 -1.10e-02 -0.252 0.370 -Inf Inf 0.00 0.000
17 5 -6.95e+13 -0.682 0.406 -Inf Inf 0.00 0.000
18 7 1.10e-02 0.237 0.132 -9.61e-01 Inf 0.05 0.000
19 20 1.30e-02 0.145 0.155 -1.18e+00 Inf 0.05 0.000
20 16 1.84e+13 0.733 0.426 -1.64e+15 2.26e+15 0.05 0.050
Estimated stopping point from ForwardStop rule = 1
# Compute p-values and confidence intervals after AIC stopping - for selected
out.aic <- fsInf(fsfit, type = "aic", sigma = sigmahat)
out.aic
Call:
fsInf(obj = fsfit, sigma = sigmahat, type = "aic")
Standard deviation of noise (specified or estimated) sigma = 0.304
Testing results at step = 5, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
2 0.052 3.62 0.060 -0.004 0.161 0.05 0.05
3 0.017 1.58 0.162 -0.124 0.907 0.05 0.05
6 0.078 1.74 0.676 -2.397 0.949 0.05 0.05
1 -0.009 -1.75 0.120 -0.309 0.026 0.05 0.05
8 0.098 1.43 0.806 -Inf 2.127 0.00 0.05
Estimated stopping point from AIC rule = 5
# Compute p-values and confidence intervals after 5 fixed steps
out.fix <- fsInf(fsfit, type = "all", k = 5, sigma = sigmahat)
out.fix
Call:
fsInf(obj = fsfit, sigma = sigmahat, k = 5, type = "all")
Standard deviation of noise (specified or estimated) sigma = 0.304
Testing results at step = 5, with alpha = 0.100
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
2 0.052 3.62 0.073 -0.008 0.077 0.049 0.05
3 0.017 1.58 0.474 -0.181 0.181 0.050 0.05
6 0.078 1.74 0.854 -2.398 0.183 0.050 0.05
1 -0.009 -1.75 0.120 -0.309 0.026 0.050 0.05
8 0.098 1.43 0.308 -1.265 2.984 0.050 0.05
LASSO - all
Use LASSO logistic regression using all available variables.
# Set seed
set.seed(141845)
# Define values
X <- X_all
y <- y_all
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.596 | 0.015 | 1.137 | 2.05e+00 |
| age_at_first_treatment_yrs | 0.914 | 0.259 | 0.577 | 1.25e+00 |
| is_eloquent_locationTRUE | 2.679 | 0.341 | 0.024 | 4.43e+02 |
| spetzler_martin_grade | 1.036 | 0.367 | 0.004 | 3.28e+04 |
| locationHemispheric | 0.517 | 0.455 | 0.069 | 2.84e+01 |
| is_diffuse_nidusTRUE | 1.622 | 0.487 | 0.053 | 9.97e+00 |
| max_size_cm | 1.104 | 0.556 | 0.279 | 2.38e+00 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.816 | 0.784 | 0.004 | 8.35e+11 |
LASSO - without scores
Use LASSO logistic regression not based on previous gradings.
# Set seed
set.seed(141845)
# Define values
X <- X_without_gradings
y <- y_without_gradings
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.614 | 0.002 | 1.261 | 2.03 |
| is_eloquent_locationTRUE | 3.464 | 0.079 | 0.782 | 8.32 |
| age_at_first_treatment_yrs | 0.912 | 0.317 | 0.837 | 1.19 |
| max_size_cm | 1.110 | 0.366 | 0.746 | 1.39 |
| is_diffuse_nidusTRUE | 1.895 | 0.383 | 0.138 | 7.90 |
LASSO - without components
Use LASSO logistic regression based on previous gradings.
# Set seed
set.seed(141845)
# Define values
X <- X_with_gradings
y <- y_with_gradings
# Perform cross-validation
# (alpha = 1 for lasso, alpha = 0 for ridge, 0 < alpha < 1 for elastic net)
cv_fit <- cv.glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
nfolds = 10
)
# Extract the best lambda
best_lambda_min <- cv_fit$lambda.min
best_lambda_1se <- cv_fit$lambda.1se
# Run glmnet
gfit <- glmnet(
x = X,
y = y,
family = "binomial",
alpha = 1,
intercept = TRUE,
standardize = TRUE
)
# Extract coef for a given lambda - avoid the intercept term
# (given the small number of observations hence large SD, use min lambda)
lambda <- best_lambda_min
n <- nrow(y)
beta <- coef(
x = X,
y = y,
object = gfit,
s = lambda,
exact = TRUE
)
# Estimate sigma
gsigmahat <- estimateSigma(
x = X,
y = y,
intercept = TRUE,
standardize = TRUE
)$sigmahat
# Compute fixed lambda p-values and selection intervals
out <- fixedLassoInf(
x = X,
y = y,
beta = beta,
lambda = lambda,
# Level of significance
alpha = 0.1,
family = "binomial",
intercept = TRUE,
sigma = gsigmahat
)
# Prettify
tibble(
names = names(out$vars),
odds_ratio = exp(out$coef0),
pvalue = out$pv,
ci_lo = exp(out$ci[, 1]),
ci_hi = exp(out$ci[, 2])
) %>%
arrange(pvalue) %>%
sable()| names | odds_ratio | pvalue | ci_lo | ci_hi |
|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.644 | 0.018 | 1.136 | 2.64e+00 |
| spetzler_martin_grade | 1.622 | 0.048 | 1.007 | 1.21e+01 |
| age_at_first_treatment_yrs | 0.904 | 0.294 | 0.828 | 1.19e+00 |
| is_diffuse_nidusTRUE | 1.718 | 0.444 | 0.091 | 8.19e+00 |
| locationHemispheric | 0.676 | 0.457 | 0.000 | 1.19e+06 |
| locationDeep | 1.173 | 0.668 | 0.000 | 4.34e+04 |
Multivariable - Without scores
Setup
Feature selection
Create predictors.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_without_gradings]
# Print
print(predictors)[1] "modified_rankin_score_pretreatment" "is_eloquent_location"
[3] "is_diffuse_nidus" "max_size_cm"
[5] "has_deficit" "has_deep_venous_drainage"
[7] "age_at_first_treatment_yrs"
Model selection
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.69 | 0.16 | 3.375 | 0.001 | 1.25 | 2.31 |
| is_eloquent_locationTRUE | 4.33 | 0.71 | 2.076 | 0.038 | 1.22 | 20.92 |
| age_at_first_treatment_yrs | 0.90 | 0.06 | -1.667 | 0.096 | 0.80 | 1.02 |
| max_size_cm | 1.11 | 0.10 | 1.069 | 0.285 | 0.92 | 1.35 |
| is_diffuse_nidusTRUE | 1.85 | 0.64 | 0.962 | 0.336 | 0.51 | 6.43 |
| has_deep_venous_drainageTRUE | 1.21 | 0.52 | 0.369 | 0.712 | 0.44 | 3.44 |
| has_deficitTRUE | 0.95 | 0.49 | -0.094 | 0.925 | 0.36 | 2.51 |
| (Intercept) | 0.02 | 1.05 | -3.603 | 0.000 | 0.00 | 0.15 |
Inference
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.69 | 0.15 | 3.531 | 0.000 | 1.26 | 2.26 |
| is_eloquent_locationTRUE | 4.33 | 0.61 | 2.407 | 0.016 | 1.31 | 14.26 |
| age_at_first_treatment_yrs | 0.90 | 0.06 | -1.559 | 0.119 | 0.80 | 1.03 |
| max_size_cm | 1.11 | 0.10 | 1.070 | 0.285 | 0.92 | 1.34 |
| is_diffuse_nidusTRUE | 1.85 | 0.62 | 0.999 | 0.318 | 0.55 | 6.17 |
| has_deep_venous_drainageTRUE | 1.21 | 0.51 | 0.378 | 0.705 | 0.45 | 3.26 |
| has_deficitTRUE | 0.95 | 0.48 | -0.097 | 0.923 | 0.38 | 2.42 |
| (Intercept) | 0.02 | 1.25 | -3.033 | 0.002 | 0.00 | 0.26 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Predict
Predict.
Plot histogram of predictions.
tibble(
Predictions = predicted_probs,
Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
pivot_longer(
cols = everything(),
names_to = "keys",
values_to = "values"
) %>%
mutate(keys = fct_inorder(keys)) %>%
ggplot(aes(x = values, color = keys, fill = keys)) +
geom_histogram(alpha = 0.7) +
geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(size = 12),
strip.text.x = element_text(size = 13),
legend.position = "none"
)Evaluate
How many examples were used?
# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)
# Print
sprintf(
"
All examples = %s (%s with outcome)
Fitted examples = %s (%s with outcome)",
num_all_examples,
num_all_examples_positive,
num_complete_cases,
num_complete_cases_positive
) %>% cat()
All examples = 232 (27 with outcome)
Fitted examples = 224 (26 with outcome)
Residual analysis.
Pseudo R-squared.
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-63.857 -80.421 33.129 0.206 0.137 0.268
ROC-AUC.
# Compute
auroc <- cvAUC::ci.cvAUC(
predictions = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME]
)
# Print
sprintf(
"Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
auroc$cvAUC,
auroc$se,
auroc$confidence * 100,
auroc$ci[1],
auroc$ci[2]
)[1] "Cross-validated AUROC = 0.801 (SE, 0.041; 95% CI, 0.721-0.882)"
PR-AUC.
# Construct object
precrec_obj_tr <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
calc_avg = F,
cb_alpha = 0.05
)
# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
precrec::part() %>%
precrec::pauc() %>%
sable()| modnames | dsids | curvetypes | paucs | spaucs |
|---|---|---|---|---|
| m1 | 1 | ROC | 0.801 | 0.801 |
| m1 | 1 | PRC | 0.449 | 0.449 |
ROC and PR curves.
Another version of the ROC curve.
# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)
# Plot ROC curve
plot(roc_curve)Area under the curve: 0.801
Plot all performance measures.
# Construct object
precrec_obj_tr2 <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
mode = "basic"
)
# Plot
precrec_obj_tr2 %>% autoplot()Confusion matrix.
Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement
No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.
# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))
# Count values per category
xtab <- table(
scores = pred,
labels = truth
)
# Compute
confusion_matrix <-
caret::confusionMatrix(
xtab,
positive = "Event",
prevalence = NULL, # Provide prevalence as a proportion here if you want
mode = "sens_spec"
)
# Print
confusion_matrixConfusion Matrix and Statistics
labels
scores Event No Event
Event 6 2
No Event 20 196
Accuracy : 0.902
95% CI : (0.855, 0.937)
No Information Rate : 0.884
P-Value [Acc > NIR] : 0.23653
Kappa : 0.316
Mcnemar's Test P-Value : 0.00029
Sensitivity : 0.2308
Specificity : 0.9899
Pos Pred Value : 0.7500
Neg Pred Value : 0.9074
Prevalence : 0.1161
Detection Rate : 0.0268
Detection Prevalence : 0.0357
Balanced Accuracy : 0.6103
'Positive' Class : Event
Variable importance
Plot importance (based on magnitude of z-score).
# Calculate importance
varimp <-
fit %>%
caret::varImp() %>%
as_tibble(rownames = "rn") %>%
mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
dplyr::select(Predictor, Importance = Overall)
# Plot variable importance
varimp %>%
ggplot(aes(Predictor, Importance, fill = Importance)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
coord_flip() +
guides(fill = F) +
labs(y = "\nVariable Importance", x = NULL) +
scale_fill_viridis_c() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(margin = margin(t = 7), size = 12),
axis.text.y = element_text(margin = margin(r = -10), size = 12),
axis.title = element_text(size = 13)
)Multivariable - Without components
Feature selection
Create predictors.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_unadjusted_recoded %>%
bind_rows(univariable_adjusted) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Print
print(predictors)[1] "modified_rankin_score_pretreatment" "is_diffuse_nidus"
[3] "spetzler_martin_grade" "age_at_first_treatment_yrs"
[5] "has_deficit" "location"
Model selection
Fit logistic.
# Define data
df <- df_multi
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
fit_with_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.71 | 0.16 | 3.347 | 0.001 | 1.25 | 2.36 |
| spetzler_martin_grade | 1.70 | 0.25 | 2.109 | 0.035 | 1.05 | 2.83 |
| age_at_first_treatment_yrs | 0.89 | 0.06 | -1.800 | 0.072 | 0.79 | 1.01 |
| is_diffuse_nidusTRUE | 1.73 | 0.60 | 0.911 | 0.362 | 0.52 | 5.63 |
| locationHemispheric | 0.61 | 0.73 | -0.671 | 0.502 | 0.15 | 2.87 |
| locationDeep | 1.14 | 0.79 | 0.162 | 0.872 | 0.26 | 5.93 |
| has_deficitTRUE | 1.06 | 0.52 | 0.104 | 0.917 | 0.38 | 2.91 |
| (Intercept) | 0.03 | 1.23 | -2.938 | 0.003 | 0.00 | 0.26 |
Inference
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.71 | 0.15 | 3.583 | 0.000 | 1.27 | 2.29 |
| spetzler_martin_grade | 1.70 | 0.28 | 1.874 | 0.061 | 0.98 | 2.96 |
| age_at_first_treatment_yrs | 0.89 | 0.06 | -1.738 | 0.082 | 0.79 | 1.01 |
| is_diffuse_nidusTRUE | 1.73 | 0.58 | 0.946 | 0.344 | 0.56 | 5.39 |
| locationHemispheric | 0.61 | 0.69 | -0.709 | 0.479 | 0.16 | 2.38 |
| locationDeep | 1.14 | 0.83 | 0.153 | 0.878 | 0.22 | 5.80 |
| has_deficitTRUE | 1.06 | 0.54 | 0.100 | 0.921 | 0.37 | 3.03 |
| (Intercept) | 0.03 | 1.45 | -2.490 | 0.013 | 0.00 | 0.46 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Predict
Predict.
Plot histogram of predictions.
tibble(
Predictions = predicted_probs,
Truth = fit$model[, CHOSEN_OUTCOME]
) %>%
pivot_longer(
cols = everything(),
names_to = "keys",
values_to = "values"
) %>%
mutate(keys = fct_inorder(keys)) %>%
ggplot(aes(x = values, color = keys, fill = keys)) +
geom_histogram(alpha = 0.7) +
geom_vline(xintercept = 0.5, linetype = 2, alpha = 0.5) +
facet_wrap(vars(keys), ncol = 1, scales = "fixed") +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text = element_text(size = 12),
strip.text.x = element_text(size = 13),
legend.position = "none"
)Evaluate
How many examples were used?
# All examples
num_all_examples <- nrow(fit$data)
num_all_examples_positive <- sum(fit$data[, CHOSEN_OUTCOME], na.rm = T)
num_complete_cases <- nrow(fit$model)
num_complete_cases_positive <- sum(fit$model[, CHOSEN_OUTCOME], na.rm = T)
# Print
sprintf(
"
All examples = %s (%s with outcome)
Fitted examples = %s (%s with outcome)",
num_all_examples,
num_all_examples_positive,
num_complete_cases,
num_complete_cases_positive
) %>% cat()
All examples = 236 (27 with outcome)
Fitted examples = 224 (26 with outcome)
Residual analysis.
Pseudo R-squared.
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-63.563 -80.421 33.717 0.210 0.140 0.273
ROC-AUC.
# Compute
auroc <- cvAUC::ci.cvAUC(
predictions = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME]
)
# Print
sprintf(
"Cross-validated AUROC = %.3f (SE, %.3f; %s%% CI, %.3f-%.3f)",
auroc$cvAUC,
auroc$se,
auroc$confidence * 100,
auroc$ci[1],
auroc$ci[2]
)[1] "Cross-validated AUROC = 0.807 (SE, 0.043; 95% CI, 0.722-0.891)"
PR-AUC.
# Construct object
precrec_obj_tr <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
calc_avg = F,
cb_alpha = 0.05
)
# Print
# NOTE: Can use precrec::auc_ci to get CIs, but need to have multiple datasets
precrec_obj_tr %>%
precrec::part() %>%
precrec::pauc() %>%
sable()| modnames | dsids | curvetypes | paucs | spaucs |
|---|---|---|---|---|
| m1 | 1 | ROC | 0.807 | 0.807 |
| m1 | 1 | PRC | 0.472 | 0.472 |
ROC and PR curves.
Another version of the ROC curve.
# Calculate ROC curve
roc_curve <- pROC::roc(fit$model[, CHOSEN_OUTCOME] ~ predicted_probs)
# Plot ROC curve
plot(roc_curve)Area under the curve: 0.807
Plot all performance measures.
# Construct object
precrec_obj_tr2 <- precrec::evalmod(
scores = predicted_probs,
labels = fit$model[, CHOSEN_OUTCOME],
mode = "basic"
)
# Plot
precrec_obj_tr2 %>% autoplot()Confusion matrix.
Kappa statistic: 0 = No agreement; 0.1-0.2 = Slight agreement; 0.2-0.4 = Fair agreement; 0.4-0.6 = Moderate agreement; 0.6-0.8 = Substantial agreement; 0.8 - 1 = Near perfect agreement
No Information Rate (NIR): The proportion of the largest observed class - for example, if we had 35 patients and 23 patients had the outcome, this is the largest class and the NIR is 23/35 = 0.657. The larger the deviation of Accuracy from NIR the more confident we are that the model is not just choosing the largest class.
# Scores
pred <- factor(ifelse(predicted_probs > 0.5, "Event", "No Event"))
truth <- factor(ifelse(fit$model[, CHOSEN_OUTCOME], "Event", "No Event"))
# Count values per category
xtab <- table(
scores = pred,
labels = truth
)
# Compute
confusion_matrix <-
caret::confusionMatrix(
xtab,
positive = "Event",
prevalence = NULL, # Provide prevalence as a proportion here if you want
mode = "sens_spec"
)
# Print
confusion_matrixConfusion Matrix and Statistics
labels
scores Event No Event
Event 6 1
No Event 20 197
Accuracy : 0.906
95% CI : (0.86, 0.941)
No Information Rate : 0.884
P-Value [Acc > NIR] : 0.175
Kappa : 0.331
Mcnemar's Test P-Value : 8.57e-05
Sensitivity : 0.2308
Specificity : 0.9949
Pos Pred Value : 0.8571
Neg Pred Value : 0.9078
Prevalence : 0.1161
Detection Rate : 0.0268
Detection Prevalence : 0.0312
Balanced Accuracy : 0.6129
'Positive' Class : Event
Variable importance
Plot importance (based on magnitude of z-score).
# Calculate importance
varimp <-
fit %>%
caret::varImp() %>%
as_tibble(rownames = "rn") %>%
mutate(Predictor = factor(rn) %>% fct_reorder(Overall)) %>%
dplyr::select(Predictor, Importance = Overall)
# Plot variable importance
varimp %>%
ggplot(aes(Predictor, Importance, fill = Importance)) +
geom_bar(stat = "identity", alpha = 0.9, width = 0.65) +
coord_flip() +
guides(fill = F) +
labs(y = "\nVariable Importance", x = NULL) +
scale_fill_viridis_c() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_blank(),
axis.text.x = element_text(margin = margin(t = 7), size = 12),
axis.text.y = element_text(margin = margin(r = -10), size = 12),
axis.title = element_text(size = 13)
)Multivariable - With interactions
Presents with hemorrhage
Fit logistic.
# Define data
df <- df_multi
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
"modified_rankin_score_pretreatment*has_hemorrhage +
spetzler_martin_grade*has_hemorrhage +
age_at_first_treatment_yrs*has_hemorrhage"
))
fit <- glm(
model,
data = df,
family = binomial()
)Print results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 2.39 | 0.39 | 2.250 | 0.024 | 1.14 | 5.36 |
| spetzler_martin_grade | 1.88 | 0.35 | 1.803 | 0.071 | 1.00 | 4.05 |
| has_hemorrhageTRUE:age_at_first_treatment_yrs | 0.84 | 0.13 | -1.373 | 0.170 | 0.65 | 1.07 |
| modified_rankin_score_pretreatment:has_hemorrhageTRUE | 0.76 | 0.44 | -0.637 | 0.524 | 0.31 | 1.76 |
| has_hemorrhageTRUE | 3.94 | 2.36 | 0.582 | 0.561 | 0.05 | 571.97 |
| age_at_first_treatment_yrs | 0.98 | 0.08 | -0.267 | 0.789 | 0.83 | 1.17 |
| has_hemorrhageTRUE:spetzler_martin_grade | 1.10 | 0.45 | 0.214 | 0.830 | 0.44 | 2.62 |
| (Intercept) | 0.01 | 1.91 | -2.682 | 0.007 | 0.00 | 0.16 |
Model comparison
Special cases
SM < 4
Select features.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Replace SM by its binary variant
predictors <- c(predictors, "is_spetzler_martin_grade_less_than_4")
predictors <- predictors[!predictors == "spetzler_martin_grade"]
# Print
print(predictors)[1] "modified_rankin_score_pretreatment"
[2] "is_diffuse_nidus"
[3] "has_deficit"
[4] "location"
[5] "age_at_first_treatment_yrs"
[6] "is_spetzler_martin_grade_less_than_4"
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.68 | 0.16 | 3.293 | 0.001 | 1.24 | 2.30 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.39 | 0.55 | -1.690 | 0.091 | 0.13 | 1.18 |
| age_at_first_treatment_yrs | 0.91 | 0.06 | -1.505 | 0.132 | 0.81 | 1.03 |
| is_diffuse_nidusTRUE | 2.15 | 0.58 | 1.327 | 0.185 | 0.68 | 6.72 |
| locationHemispheric | 0.63 | 0.72 | -0.642 | 0.521 | 0.16 | 2.90 |
| locationDeep | 1.44 | 0.78 | 0.470 | 0.639 | 0.33 | 7.35 |
| has_deficitTRUE | 1.08 | 0.51 | 0.145 | 0.885 | 0.39 | 2.95 |
| (Intercept) | 0.18 | 1.00 | -1.684 | 0.092 | 0.02 | 1.24 |
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.68 | 0.14 | 3.610 | 0.000 | 1.27 | 2.22 |
| is_spetzler_martin_grade_less_than_4TRUE | 0.39 | 0.60 | -1.569 | 0.117 | 0.12 | 1.26 |
| age_at_first_treatment_yrs | 0.91 | 0.07 | -1.407 | 0.159 | 0.80 | 1.04 |
| is_diffuse_nidusTRUE | 2.15 | 0.56 | 1.363 | 0.173 | 0.71 | 6.49 |
| locationHemispheric | 0.63 | 0.69 | -0.674 | 0.500 | 0.16 | 2.43 |
| locationDeep | 1.44 | 0.83 | 0.439 | 0.661 | 0.28 | 7.33 |
| has_deficitTRUE | 1.08 | 0.55 | 0.136 | 0.892 | 0.37 | 3.15 |
| (Intercept) | 0.18 | 1.04 | -1.628 | 0.104 | 0.02 | 1.41 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Had surgery
Select features.
# Define p-value threshold
pval_threshold <- 0.2
# Get the predictors of interest
predictors <-
univariable_adjusted_recoded %>%
bind_rows(univariable_unadjusted_recoded) %>%
filter(`P-values` < pval_threshold) %>%
pull(`Predictors`)
# Clean categorical variables
predictors <-
predictors %>%
str_replace("^([a-z0-9_]*)([A-Z].*)$", "\\1") %>%
unique()
# Add adjustment variables
predictors <- unique(c(predictors))
# Remove unwanted predictors
predictors <- predictors[!predictors %in% unwanted_with_gradings]
# Replace SM by its binary variant
predictors <- c(predictors, "is_surgery")
predictors <- predictors[!predictors == "procedure_combinations"]
# Print
print(predictors)[1] "modified_rankin_score_pretreatment" "spetzler_martin_grade"
[3] "is_diffuse_nidus" "has_deficit"
[5] "location" "age_at_first_treatment_yrs"
[7] "is_surgery"
Fit logistic.
# Define data
df <- df_multi %>% drop_na(modified_rankin_score_pretreatment, location)
# Create formula
model <- as.formula(paste(
CHOSEN_OUTCOME,
"~",
paste(predictors, collapse = " + ")
))
fit <- glm(
model,
data = df,
family = binomial()
)
# Save
fit_without_grading <- fitPrint results.
# Summarize model coefficients
fit_summary <-
fit %>%
broom::tidy(exponentiate = T, conf.int = T) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
stylized <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
stylized %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.68 | 0.16 | 3.195 | 0.001 | 1.23 | 2.33 |
| spetzler_martin_grade | 1.84 | 0.28 | 2.213 | 0.027 | 1.08 | 3.22 |
| age_at_first_treatment_yrs | 0.90 | 0.06 | -1.773 | 0.076 | 0.79 | 1.01 |
| is_diffuse_nidusTRUE | 1.67 | 0.60 | 0.846 | 0.398 | 0.50 | 5.44 |
| is_surgeryTRUE | 1.69 | 0.71 | 0.737 | 0.461 | 0.40 | 6.76 |
| locationHemispheric | 0.60 | 0.73 | -0.694 | 0.487 | 0.15 | 2.83 |
| locationDeep | 1.18 | 0.79 | 0.205 | 0.837 | 0.27 | 6.22 |
| has_deficitTRUE | 1.08 | 0.52 | 0.150 | 0.881 | 0.39 | 3.02 |
| (Intercept) | 0.02 | 1.34 | -2.963 | 0.003 | 0.00 | 0.22 |
Create robust CIs using the sandwich estimator.
# Calculate robust standard errors
robust_se <- sandwich::vcovHC(fit, type = "HC0")
# Compute robust confidence intervals
robust_ci <- lmtest::coeftest(fit, vcov. = robust_se)Stylize results.
# Summarize model coefficients
fit_summary <-
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
arrange(term == "(Intercept)", p.value) %>%
rename(odds_ratio = estimate, z_value = statistic)
# Stylize and print
multivariable_pvalue <-
fit_summary %>%
rename(
"Predictors" = term,
"Odds Ratios (OR)" = odds_ratio,
"SE" = std.error,
"Z-scores" = z_value,
"P-values" = p.value,
"CI (low)" = conf.low,
"CI (high)" = conf.high,
) %>%
mutate(
`Odds Ratios (OR)` = round(`Odds Ratios (OR)`, 2),
`SE` = round(`SE`, 2),
`CI (low)` = round(`CI (low)`, 2),
`CI (high)` = round(`CI (high)`, 2),
`P-values` = round(`P-values`, 3),
)
# Print
multivariable_pvalue %>% sable()| Predictors | Odds Ratios (OR) | SE | Z-scores | P-values | CI (low) | CI (high) |
|---|---|---|---|---|---|---|
| modified_rankin_score_pretreatment | 1.68 | 0.15 | 3.522 | 0.000 | 1.26 | 2.23 |
| spetzler_martin_grade | 1.84 | 0.32 | 1.887 | 0.059 | 0.98 | 3.46 |
| age_at_first_treatment_yrs | 0.90 | 0.06 | -1.712 | 0.087 | 0.79 | 1.02 |
| is_diffuse_nidusTRUE | 1.67 | 0.60 | 0.857 | 0.392 | 0.52 | 5.36 |
| locationHemispheric | 0.60 | 0.69 | -0.743 | 0.458 | 0.16 | 2.31 |
| is_surgeryTRUE | 1.69 | 0.76 | 0.685 | 0.493 | 0.38 | 7.54 |
| locationDeep | 1.18 | 0.84 | 0.193 | 0.847 | 0.23 | 6.13 |
| has_deficitTRUE | 1.08 | 0.55 | 0.143 | 0.886 | 0.37 | 3.16 |
| (Intercept) | 0.02 | 1.71 | -2.315 | 0.021 | 0.00 | 0.54 |
Plot the coefficients with their confidence intervals.
robust_ci %>%
# broom does not support exponentiation after coeftest so do it manually
broom::tidy(exponentiate = F, conf.int = T) %>%
mutate(across(c(estimate, "conf.low", "conf.high"), exp)) %>%
filter(term != "(Intercept)") %>%
arrange(desc(estimate)) %>%
mutate(term = factor(term, levels = term)) %>%
ggplot(aes(x = estimate, y = term, color = term)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high)) +
geom_vline(xintercept = 1, linetype = 2, alpha = 0.5) +
labs(x = "Odds ratio", y = NULL) +
guides(color = F) +
theme(
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)Grade 5s
df_uni %>%
drop_na(spetzler_martin_grade) %>%
count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
mutate(
pct_patients = scales::percent(num_patients / sum(num_patients), 1)
) %>%
sable()| is_poor_outcome | spetzler_martin_grade | num_patients | pct_patients |
|---|---|---|---|
| FALSE | 1 | 38 | 17% |
| FALSE | 2 | 45 | 20% |
| FALSE | 3 | 65 | 29% |
| FALSE | 4 | 37 | 16% |
| FALSE | 5 | 17 | 7% |
| TRUE | 1 | 2 | 1% |
| TRUE | 2 | 3 | 1% |
| TRUE | 3 | 6 | 3% |
| TRUE | 4 | 7 | 3% |
| TRUE | 5 | 8 | 4% |
Only consider those with SM grade 5.
df_uni %>%
filter(spetzler_martin_grade == 5) %>%
count(is_poor_outcome, spetzler_martin_grade, name = "num_patients") %>%
mutate(
pct_patients = scales::percent(num_patients / sum(num_patients), 1)
) %>%
sable()| is_poor_outcome | spetzler_martin_grade | num_patients | pct_patients |
|---|---|---|---|
| FALSE | 5 | 17 | 68% |
| TRUE | 5 | 8 | 32% |
Honest esitmation (WIP)
https://www.pnas.org/doi/full/10.1073/pnas.1510489113 https://arxiv.org/pdf/1510.04342
https://github.com/grf-labs/grf https://grf-labs.github.io/grf/REFERENCE.html https://cran.r-project.org/web/packages/hettx/vignettes/detect_idiosyncratic_vignette.html
estimate std.err
0.3370 0.0492
estimate std.err
0.4107 0.0509
Write
Setup
Create the necessary directories.
# Get today's date
today <- Sys.Date()
today <- format(today, "%Y-%m-%d")
# Create folder names
base_folder <- file.path(DST_DIRNAME, ANALYSIS_NAME)
date_folder <- file.path(base_folder, today)
csv_folder <- file.path(date_folder, "csv")
pdf_folder <- file.path(date_folder, "pdf")
html_folder <- file.path(date_folder, "html")
# Create folders
suppressWarnings(dir.create(base_folder))
suppressWarnings(dir.create(date_folder))
suppressWarnings(dir.create(csv_folder))
suppressWarnings(dir.create(pdf_folder))
suppressWarnings(dir.create(html_folder))
# Print folder names
print(base_folder)
print(date_folder)
print(csv_folder)
print(pdf_folder)
print(html_folder)[1] "../../outputs//predictive-analytics/poor-outcomes"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-29"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-29/csv"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-29/pdf"
[1] "../../outputs//predictive-analytics/poor-outcomes/2024-07-29/html"
Figures
Write all figures.
Tables
Write all tables.
# # Arguments
# df <- tables$desc_stats_cohorts_cv_prolaio
# filepath_csv <- file.path(csv_folder, "desc-stats_by-cohort_cov.csv")
# filepath_html <- file.path(html_folder, "desc-stats_by-cohort_cov.html")
#
# # Save as CSV
# write_csv(
# x = df,
# file = filepath_csv
# )
#
# # Save as HTML
# # - Save pdf does not work with webshot
# # - It works with pagedown but not as pretty as desired
# df %>%
# sable() %>%
# kableExtra::save_kable(file = filepath_html)Reproducibility
Linting and styling
Dependency management
Documentation
Session Info
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;
LAPACK version 3.11.0
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] grf_2.3.2 table1_1.4.3 lubridate_1.9.3
[4] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[7] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[10] tibble_3.2.1 tidyverse_2.0.0 selectiveInference_1.2.5
[13] MASS_7.3-60 adaptMCMC_1.5 coda_0.19-4.1
[16] survival_3.5-7 intervals_0.15.4 glmnet_4.1-8
[19] Matrix_1.6-1.1 magrittr_2.0.3 ggplot2_3.5.0
[22] kableExtra_1.4.0 rmdformats_1.0.4 knitr_1.45
loaded via a namespace (and not attached):
[1] rstudioapi_0.15.0 jsonlite_1.8.8 shape_1.4.6.1
[4] magick_2.8.3 farver_2.1.1 rmarkdown_2.25
[7] vctrs_0.6.5 ROCR_1.0-11 base64enc_0.1-3
[10] htmltools_0.5.7 broom_1.0.6 Formula_1.2-5
[13] pROC_1.18.5 caret_6.0-94 sass_0.4.8
[16] parallelly_1.37.1 bslib_0.6.1 desc_1.4.3
[19] cvAUC_1.1.4 plyr_1.8.9 sandwich_3.1-0
[22] zoo_1.8-12 cachem_1.0.8 lifecycle_1.0.4
[25] iterators_1.0.14 pkgconfig_2.0.3 R6_2.5.1
[28] fastmap_1.1.1 future_1.33.2 digest_0.6.34
[31] colorspace_2.1-0 ps_1.7.6 patchwork_1.2.0
[34] labeling_0.4.3 fansi_1.0.6 timechange_0.3.0
[37] compiler_4.3.2 remotes_2.4.2.1 proxy_0.4-27
[40] bit64_4.0.5 withr_3.0.0 pander_0.6.5
[43] backports_1.4.1 ggcorrplot_0.1.4.1 highr_0.10
[46] R.utils_2.12.3 lava_1.8.0 ModelMetrics_1.2.2.2
[49] tools_4.3.2 lmtest_0.9-40 future.apply_1.11.2
[52] lintr_3.1.1 nnet_7.3-19 R.oo_1.26.0
[55] glue_1.7.0 callr_3.7.5 nlme_3.1-163
[58] R.cache_0.16.0 grid_4.3.2 checkmate_2.3.1
[61] reshape2_1.4.4 generics_0.1.3 precrec_0.14.4
[64] recipes_1.0.10 gtable_0.3.4 tzdb_0.4.0
[67] R.methodsS3_1.8.2 class_7.3-22 data.table_1.15.4
[70] hms_1.1.3 xml2_1.3.6 utf8_1.2.4
[73] foreach_1.5.2 pillar_1.9.0 vroom_1.6.5
[76] splines_4.3.2 pryr_0.1.6 lattice_0.21-9
[79] renv_1.0.4 bit_4.0.5 tidyselect_1.2.1
[82] bookdown_0.39 svglite_2.1.3 stats4_4.3.2
[85] xfun_0.42 hardhat_1.4.0 rapportools_1.1
[88] timeDate_4032.109 matrixStats_1.3.0 rex_1.2.1
[91] stringi_1.8.3 lazyeval_0.2.2 yaml_2.3.8
[94] evaluate_0.23 codetools_0.2-19 tcltk_4.3.2
[97] BiocManager_1.30.22 cli_3.6.2 rpart_4.1.21
[100] systemfonts_1.0.5 processx_3.8.3 munsell_0.5.0
[103] jquerylib_0.1.4 styler_1.10.2 xmlparsedata_1.0.5
[106] pscl_1.5.9 Rcpp_1.0.12 globals_0.16.3
[109] summarytools_1.0.1 gower_1.0.1 assertthat_0.2.1
[112] listenv_0.9.1 viridisLite_0.4.2 ipred_0.9-14
[115] cyclocomp_1.1.1 scales_1.3.0 prodlim_2023.08.28
[118] e1071_1.7-14 crayon_1.5.2 rlang_1.1.3
References
[[1]]
Scheidegger A, , (2024). _adaptMCMC: Implementation of a Generic
Adaptive Monte Carlo Markov Chain Sampler_. R package version 1.5,
<https://CRAN.R-project.org/package=adaptMCMC>.
[[2]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[3]]
Plummer M, Best N, Cowles K, Vines K (2006). "CODA: Convergence
Diagnosis and Output Analysis for MCMC." _R News_, *6*(1), 7-11.
<https://journal.r-project.org/archive/>.
[[4]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[5]]
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A
Grammar of Data Manipulation_. R package version 1.1.4,
<https://CRAN.R-project.org/package=dplyr>.
[[6]]
Wickham H (2023). _forcats: Tools for Working with Categorical
Variables (Factors)_. R package version 1.0.0,
<https://CRAN.R-project.org/package=forcats>.
[[7]]
Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_.
Springer-Verlag New York. ISBN 978-3-319-24277-4,
<https://ggplot2.tidyverse.org>.
[[8]]
Friedman J, Tibshirani R, Hastie T (2010). "Regularization Paths for
Generalized Linear Models via Coordinate Descent." _Journal of
Statistical Software_, *33*(1), 1-22. doi:10.18637/jss.v033.i01
<https://doi.org/10.18637/jss.v033.i01>.
Simon N, Friedman J, Tibshirani R, Hastie T (2011). "Regularization
Paths for Cox's Proportional Hazards Model via Coordinate Descent."
_Journal of Statistical Software_, *39*(5), 1-13.
doi:10.18637/jss.v039.i05 <https://doi.org/10.18637/jss.v039.i05>.
Tay JK, Narasimhan B, Hastie T (2023). "Elastic Net Regularization
Paths for All Generalized Linear Models." _Journal of Statistical
Software_, *106*(1), 1-31. doi:10.18637/jss.v106.i01
<https://doi.org/10.18637/jss.v106.i01>.
[[9]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[10]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[11]]
Tibshirani J, Athey S, Sverdrup E, Wager S (2024). _grf: Generalized
Random Forests_. R package version 2.3.2,
<https://CRAN.R-project.org/package=grf>.
[[12]]
Bourgon R (2023). _intervals: Tools for Working with Points and
Intervals_. R package version 0.15.4,
<https://CRAN.R-project.org/package=intervals>.
[[13]]
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
Pipe Syntax_. R package version 1.4.0,
<https://CRAN.R-project.org/package=kableExtra>.
[[14]]
Xie Y (2023). _knitr: A General-Purpose Package for Dynamic Report
Generation in R_. R package version 1.45, <https://yihui.org/knitr/>.
Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition.
Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963,
<https://yihui.org/knitr/>.
Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in
R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible
Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595.
[[15]]
Grolemund G, Wickham H (2011). "Dates and Times Made Easy with
lubridate." _Journal of Statistical Software_, *40*(3), 1-25.
<https://www.jstatsoft.org/v40/i03/>.
[[16]]
Bache S, Wickham H (2022). _magrittr: A Forward-Pipe Operator for R_. R
package version 2.0.3, <https://CRAN.R-project.org/package=magrittr>.
[[17]]
Venables WN, Ripley BD (2002). _Modern Applied Statistics with S_,
Fourth edition. Springer, New York. ISBN 0-387-95457-0,
<https://www.stats.ox.ac.uk/pub/MASS4/>.
[[18]]
Bates D, Maechler M, Jagan M (2023). _Matrix: Sparse and Dense Matrix
Classes and Methods_. R package version 1.6-1.1,
<https://CRAN.R-project.org/package=Matrix>.
[[19]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[20]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[21]]
Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R
package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
[[22]]
Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text
Data_. R package version 2.1.5,
<https://CRAN.R-project.org/package=readr>.
[[23]]
Barnier J (2022). _rmdformats: HTML Output Formats and Templates for
'rmarkdown' Documents_. R package version 1.0.4,
<https://CRAN.R-project.org/package=rmdformats>.
[[24]]
Tibshirani R, Tibshirani R, Taylor J, Loftus J, Reid S, Markovic J
(2019). _selectiveInference: Tools for Post-Selection Inference_. R
package version 1.2.5,
<https://CRAN.R-project.org/package=selectiveInference>.
[[25]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
[[26]]
Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common
String Operations_. R package version 1.5.1,
<https://CRAN.R-project.org/package=stringr>.
[[27]]
Therneau T (2023). _A Package for Survival Analysis in R_. R package
version 3.5-7, <https://CRAN.R-project.org/package=survival>.
Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival
Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.
[[28]]
Rich B (2023). _table1: Tables of Descriptive Statistics in HTML_. R
package version 1.4.3, <https://CRAN.R-project.org/package=table1>.
[[29]]
Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package
version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
[[30]]
Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R
package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
[[31]]
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E,
Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi
K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the
tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
[[32]]
R Core Team (2023). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.